Continual Learning of a Transformer-Based Deep Learning Classifier Using an Initial Model from Action Observation EEG Data to Online Motor Imagery Classification

The motor imagery (MI)-based brain computer interface (BCI) is an intuitive interface that enables users to communicate with external environments through their minds. However, current MI-BCI systems ask naïve subjects to perform unfamiliar MI tasks with simple textual instruction or a visual/auditory cue. The unclear instruction for MI execution not only results in large inter-subject variability in the measured EEG patterns but also causes the difficulty of grouping cross-subject data for big-data training. In this study, we designed an BCI training method in a virtual reality (VR) environment. Subjects wore a head-mounted device (HMD) and executed action observation (AO) concurrently with MI (i.e., AO + MI) in VR environments. EEG signals recorded in AO + MI task were used to train an initial model, and the initial model was continually improved by the provision of EEG data in the following BCI training sessions. We recruited five healthy subjects, and each subject was requested to participate in three kinds of tasks, including an AO + MI task, an MI task, and the task of MI with visual feedback (MI-FB) three times. This study adopted a transformer- based spatial-temporal network (TSTN) to decode the user’s MI intentions. In contrast to other convolutional neural network (CNN) or recurrent neural network (RNN) approaches, the TSTN extracts spatial and temporal features, and applies attention mechanisms along spatial and temporal dimensions to perceive the global dependencies. The mean detection accuracies of TSTN were 0.63, 0.68, 0.75, and 0.77 in the MI, first MI-FB, second MI-FB, and third MI-FB sessions, respectively. This study demonstrated the AO + MI gave an easier way for subjects to conform their imagery actions, and the BCI performance was improved with the continual learning of the MI-FB training process.


Introduction
Brain computer interface (BCI) is the technology for people to directly communicate with external devices through the acquisitions and translations of their brain activities. A BCI system measures the neural activities from the central nervous system (CNS) and converts them into artificial outputs in order to replace, restore, enhance, supplement, or improve natural CNS output [1]. Several BCI systems have been developed based on different brain imaging modalities, such as electroencephalography (EEG), magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI), and functional near-infrared spectroscopy (fNIRS) [2]. In contrast to the fMRI and fNIRS which measure the slow changes of hemodynamic responses, the EEG/MEG records fast changes the performances of auditory feedback training with visual feedback training in an MI-BCI system [23]. No difference in the BCI performances was found between the uses of the two feedback modalities. Ishihara et al. (2020) reviewed the effectiveness of feedback modalities in 184 BCI studies [24], in which the visual feedback was the most effective approach for neurofeedback in MI-BCI training. Owing to the advantages of an intuitively, highly immersive sensation and a better flexibility for experimental design, the visual feedback is most widely chosen to implement a BCI system.
Though visual feedback has been demonstrated as an effective and efficient way for neural training in MI-BCI system [25], user's training performance is highly-related to the immersive feeling and self-perception [26]. Ono et al. (2013) compared the values of event-related desynchronization (ERD) in the conditions of no feedback, bar feedback, and incongruent/congruent feedback. In the incongruent and congruent feedbacks, a real hand open/grasp picture was displayed on a screen. They found the participants in a congruent condition achieved higher ERD values and better detection accuracies. They also demonstrated the visual feedback of real-hand picture-induced higher ERD values than bar feedback or no feedback. Achanccaray et al. (2018) designed an MI-based BCI system to control a 3D arm in a VR environment [27]. Ziadeh et al. (2021) [26] studied the influence of embodiment in the VR environment on a subject's BCI performance. They found both the senses of the ownership and agency can influence the MI training performance [28].
Most MI-BCI systems request subjects to participate in multiple experiment sessions before they are able to operate the MI-BCI with an acceptable detection rate, including MI sessions and MI-feedback (MI-FB) sessions [26,29]. In the MI-FB sessions, participants performed the same MI tasks but received online classification results via feedbacks. However, previous research has indicated that a nonnegligible portion of subjects (about 15~30%) were not able to control the BCI systems through the aforementioned BCI training process [30]. Vidaurre et al. (2011) hypothesized two reasons for the problems related to the successful operation of an MI-BCI system, in which the first reason is the lack of a generic pattern for initial BCI training, and the second reason is the difficulty of transition from offline calibration to online feedback. For the first point, the creation of a generic pattern for a classifier setup is difficult since most BCI training lacks a clear instruction for movement imagination which could result in large inter-subject difference in the induced brain wave patterns. It has also been difficult to obtain a consistent brain wave pattern in previous studies due to the diversity of MI task design in different BCI systems. For example, some studies requested participants to imagine the simple moving of an object on the screen [22], while the others claimed a more complicated MI task, such as imagining a serial movement of a hand or a foot. Recent studies have also demonstrated that the training efficacies can be enhanced by creating the illusions of ownership and agency [20,28,30] which can be conveniently achieved by wearing a VR head-mount device (HMD) [21,26,27,31,32]. For the second point, most BCI studies usually created their initial classifiers from the EEG data in MI tasks with no feedback, and tried to modify the initial classifier to fit online MI-FB applications [33]. Nevertheless, the gap between the initial model and the final classifier could be huge, so that subjects usually had to participate in a large amount of MI-FB training until acceptable results were obtained.
In this study, we intend to study the feasibility of building the initial classifier for a BCI system using the EEG data obtained from an action observation (AO) task. Action observation has been proposed as an effective rehabilitation therapy approach based on the role of a mirror neuron system in motor learning [34]. The mirror neuron system is activated during the observation of an action execution. In contrast to mirror therapy (MT) which requests semi-paralyzed patients to activate the mirror neuron system by watching the reflection image of moving unaffected sides in a mirror [35], the AO therapy conducts the formation of motor memory by viewing the limb movements which can provide a more flexible way in experimental design. Hsieh et al. (2020) compared the rehabilitation outcomes between the interventions of AO therapy and MT in stroke patients [36], and they concluded the AO therapy can lead to better improvements in patients. Vogt et al. (2013) reviewed neuroimaging studies and claimed neural activities in motor cortices can be enhanced by performing concurrent AO and MI [37]. Though the role of AO in motor activities has been studied in the previous literature [27,38,39], most BCI studies utilized AO as a feedback approach for improving a subject's MI performance rather than using the brain activities in AO for constructing an initial BCI classifier. Since the poor setting of initial BCI classifiers could frustrate the subject and impede training efficiency [40], the AO, whose brain activation areas are located in conjunction with the brain regions recruited in an MI task [38], could be a possible way to generate similar signal features for the initial setup of a BCI classifier. This study aims to initiate the parameter settings of BCI classifiers based on the collection of EEG data from AO plus the MI (AO + MI) task. We discuss the feasibility of tuning this initial classifier to achieve better classification results by means of using continual learning with MI-FB data. The AO + MI provides a convenient way for subjects to follow the experimental designer's instruction. It also gives an easier way for subjects to conform their imagery actions with the expected task design.
The present study intended to build a training method for MI-BCI based on AO + MI in the VR environment. We aimed to answer the following questions: (1) Is the classifier trained from AO + MI data good enough for MI classifications? (2) Can we generate a classifier from AO + MI data and tune the classifier with the continual learning of MI data? (3) Can we adopt a transformer-based deep learning classifier in our BCI training? (4) Can the detection accuracy of the classifier which is trained from cross-subject AO + MI data be improved by the continual learning of individual MI data? Because both the AO and MI are two motor tasks with no actual motor executions but sharing overlapped motor areas in their activations, we wonder if the requisite of large amount of EEG data in tedious MI task could be replaced by EEG data from AO + MI task in an easier way.

Subjects and EEG Recordings
Five normal subjects (5 males, all right-handed subjects; mean age = 24 ± 4.2 years) were recruited to participate in our study. All participants were requested to sit in a comfortable armchair in a dimly illuminated electro-magnetic shielded room. EEG signals were recorded using an eight-channel dry-electrode wireless EEG system (bandpass 0.05-250 Hz; 55-65 Hz bandstop; 24-bits data resolution; digitized at 1 kHz; InMex EEG system, WellFulfill Co., Taoyuan, Taiwan). The EEG electrodes were placed in accordance with the international 10-20 system. Electrodes were located at FC1, FC2, C3, Cz, C4, P1, and P2 positions, with respect to a reference electrode placed at the right mastoid and a ground electrode placed at the left mastoid. Each EEG dry electrode was constructed by 10 spring-loaded copper pins, and its biocompatibility was certificated by ISO 10993. Immersive scenarios were created in the virtual reality environment using Unity 3D Engine (Unity Technologies Inc., San Francisco, CA, USA) and provided to the participants by wearing a Head Mounted Display system (HTC VIVE, HTC Co., Taipei, Taiwan) (see Figure 1).

Experimental Task
Participants came to participate in our experiment for four weeks, one time in each week. In the first week, subjects were requested to participate in an AO + MI session, and then they were asked to join three MI-FB sessions in the following three weeks. The VR environment was programmed using Unity 3D and displayed on a head-mount display (HMD) device to provide an immersive experience for the subjects. Subjects were asked to imagine they were looking into a mirror and the virtual character was the reflection image of themselves. In the AO + MI session, each trial contained two AO + MI time blocks and one MI time block (without feedback). In each trial, the designated hand, either the left or right hand, was chosen randomly. During the AO + MI time block, the virtual character raised its designated hand once, and subjects were requested to perform an MI movement of the same hand concurrently with the action of the virtual character. During the MI time block, subjects were asked to perform an MI movement of the same designated hand just after they saw the presence of the direction sign (see Figure 1). Subjects were requested to perform 300 trials in the AO + MI session, 150 trials for left hands and 150 trials for right hands, arranged in a random order. For every twenty trials, subjects had a five-minute break to prevent them from mental fatigue. Each AO + MI time block was the concatenation of a resting period (5~7 s) and an AO + MI period (2 s). In the resting period of the AO + MI time block, the virtual character was kept still with no movement. In the AO + MI period, the virtual character performed a front arm raise of the designated hand once. Subjects were instructed to imagine the virtual character was the reflection images of themselves, similar to looking into a mirror. In the MI time block, subjects were requested to perform a mental rehearsal of the arm raise action which they viewed in the previous AO + MI time block. One resting period (5~7 s) was concatenated with an MI period (2 s). In the resting period of the MI time block, a cross sign was presented, and subjects were asked to relax without movement intention. In the MI period, a direction sign was presented, and subjects were requested to perform a mental rehearsal of the arm raise action of the designed arm which they viewed in the AO + MI time block.

Experimental Task
Participants came to participate in our experiment for four weeks, one time in each week. In the first week, subjects were requested to participate in an AO + MI session, and then they were asked to join three MI-FB sessions in the following three weeks. The VR environment was programmed using Unity 3D and displayed on a head-mount display (HMD) device to provide an immersive experience for the subjects. Subjects were asked to imagine they were looking into a mirror and the virtual character was the reflection image of themselves. In the AO + MI session, each trial contained two AO + MI time blocks and one MI time block (without feedback). In each trial, the designated hand, either the left or right hand, was chosen randomly. During the AO + MI time block, the virtual character raised its designated hand once, and subjects were requested to perform an MI movement of the same hand concurrently with the action of the virtual character. During the MI time block, subjects were asked to perform an MI movement of the same designated hand just after they saw the presence of the direction sign (see Figure 1). Subjects were requested to perform 300 trials in the AO + MI session, 150 trials for left hands and 150 trials for right hands, arranged in a random order. For every twenty trials, subjects had a five-minute break to prevent them from mental fatigue. Each AO + MI time block was the concatenation of a resting period (5~7 s) and an AO + MI period (2 s). In the resting period of the AO + MI time block, the virtual character was kept still with no movement. In the AO + MI period, the virtual character performed a front arm raise of the designated hand once. Subjects were instructed to imagine the virtual character was the reflection images of themselves, similar to looking into a mirror. In the MI time block, subjects were requested to perform a mental rehearsal of the arm raise action which they viewed in the previous AO + MI time block. One resting period (5~7 s) was concatenated with an MI  After the AO + MI sessions, subjects were requested to participate in three MI-FB sessions in the following three weeks, one session in a week. Each MI-FB session contained 150 trials of MI movements (with visual feedback), one-third of the trials (50 trials) for left-hand imagery movements, one-third of the trials (50 trials) for right-hand imagery movements, and one-third of the trials (50 trials) for no movement, arranged in a random order. In each trial, a preparation sign was presented before the movement indication sign. Subjects were initially instructed to keep themselves relaxed in a preparation time block (5~7 s), and then a left or right direction sign was presented to instruct subjects which hand should be performed in the MI movement time block (3 s). For each MI movement, subjects were requested to perform a mental rehearsal of the arm raise action of the designated hand that they viewed in the AO + MI session. The EEG signals in each MI time block were classified into one of the three classes (i.e., the left-arm movement, the right-arm movement, and the resting state), and the classification result was feedbacked to the subject by driving the virtual character to show the arm raise movement of the classified result in the feedback time block (2 s), accompanied with a circle or cross sign to respond to the correctness of the MI classification in this trial. The experimental paradigms of the AO + MI and the MI-FB sessions are shown in Figures 2a and 2b, respectively. All participants gave informed consent, and the study was approved by the Ethics Committee of the Institutional Review Board (IRB) (TYGH 107055), Tao-Yuan General Hospital, Taiwan. All measurements were noninvasive, and the subjects were free to withdraw at any time.
to the correctness of the MI classification in this trial. The experimental paradigms of the AO + MI and the MI-FB sessions are shown in Figure 2a and Figure 2b, respectively. All participants gave informed consent, and the study was approved by the Ethics Committee of the Institutional Review Board (IRB) (TYGH 107055), Tao-Yuan General Hospital, Taiwan. All measurements were noninvasive, and the subjects were free to withdraw at any time.

Transformer-Baed Spatial-Temporal Network (TSTN) for MI Classification
In this study, we adopted the work proposed by Song et al. (2021) [41], which applied attention mechanisms on the spatial and temporal features of EEG data. The eight-channel EEG signals were prefiltered within 4~40 Hz (3rd-order Butterworth IIR filter) and then downsampled to 250 Hz. The EEG data were then segmented into two-second epochs and spatially filtered using a common spatial pattern (CSP) to extract the discriminant features. The spatial features in the feature-channel data were further enhanced by applying spatial transforming with an attention mechanism. The enhanced feature-channel data

Transformer-Baed Spatial-Temporal Network (TSTN) for MI Classification
In this study, we adopted the work proposed by Song et al. (2021) [41], which applied attention mechanisms on the spatial and temporal features of EEG data. The eight-channel EEG signals were prefiltered within 4~40 Hz (3rd-order Butterworth IIR filter) and then downsampled to 250 Hz. The EEG data were then segmented into two-second epochs and spatially filtered using a common spatial pattern (CSP) to extract the discriminant features. The spatial features in the feature-channel data were further enhanced by applying spatial transforming with an attention mechanism. The enhanced feature-channel data were segmented across time and segmented into embedded patches using two convolution layers. The relationship among different temporal patches was further perceived using multi-head transforming to obtain distinguishable representations. The features processed by the temporal transformer were averaged and a fully connected layer was used to classify the feature-segment representations into one of the three categories (i.e., the left-arm movement, the right-arm movement, and the resting state). The detailed signal processing is described as follows.

Spatial Filtering Using Common Spatial Pattern (CSP)
CSP is an effective feature extraction method which maximizes the discriminability of two classes by constructing a set of optimized spatial filters [41]. In our study, the CSP was constructed based on a one-versus-rest (OVR) strategy, i.e., rest vs. left/right, left vs. rest/right and right vs. rest/left), to extract feature-channel signals. For each CSP, the covariance metrices R 1 and R 2 were calculated from two sets of EEG signals X 1 and X 2 , in which C is the number of EEG channel (C = 8), T is the length of sampled data (T = 2000), and X 1 ∈ C×T and X 2 ∈ C×T are the EEG data of the classes that we want to identify. The CSP seeks to find a matrix that contains eigenvectors P = p 1 p 2 . . . p c for simultaneously diagonalizing the covariance matrixes R 1 and R 2 , which can be represented as: And in which D is the diagonal matrix with eigen values {λ 1 , λ 2 , . . . , λ c } sorted in descending order and I is the identify matrix. The first two columns p 1 and p 2 in P were chosen, and the transpose of the first two column vectors p T 1 and p T 2 in the subfilters of the three CSPs were stacked as the spatial filter Z Z ∈ 6×c . For an EEG data matrix X (X ∈ c×T ), the filtered EEG data S can be represented as: where S ∈ 6×T contains the feature-channel signals filtered by the six subfilters obtained from the three discriminations, with two subfilters for each discrimination.

Spatial Transforming for the Enhancement of Feature-Channel Signals
The feature information in the feature-channel data S was enhanced by applying the self-attention mechanism. The input vectors were multiplied with three weighted matrices W q , W k , and W v to obtain the key, query, and the value vectors for each of the input vectors, represented as: and in which a i is the ith input vector, and q i , k i , and v i are the query, key, and value vectors. The similarity between q i and k j is then estimated by calculating the scaled dot-product: where d (d = 500) is the dimension of k j . The value of S(q i , k j ) is then normalized using the softmax function to obtainα i,j and then multiplied by vector v j to obtain the ith output vector b i . The calculation can be represented as the following equation: The whole process can be represented as: where the output matrix B contains the enhanced feature-channel data.

Patch Embedding of Feature-Channel Signals
In order to reduce computation complexity, the feature-channel data were segmented into data patches and the dependencies among different patches in a trial were learned using a temporal transforming network (see below). The enhanced feature-channel data B were passed through a one-dimension convolution layer (number of filter = 2; kernel size = 51) and then processed by a two-dimension convolution layer (number of filter = 10; kernel size = 6 × 5) to creat 10 signal patches, in which each patch has a data length of 90 samples. The 10 patches were used as inputs for the following temporal transforming network.

Temporal Transforming for Embedded Patches
Multi-head attention (MHA) was employed to explore the dependencies among different patches in our temporal transforming step. The MHA allows the network to attend parts of the sequence differently. The embedded patches (10 signal patches) were divided into five data sets E i (E i ∈ 2×90 ; i = 1, · · · , 5) passed through h (h = 5) independent attention networks. The outputs of these attention networks were concatenated and combined together with a final weight matrix W o . The MHA can be represented as follows: where F F ∈ 10×90 is the output of multihead attention, is the function to concatenate the output of each head together,

Classifier
The output of multi-head attention F was then passed to an averaging pooling layer, normalized, and then the classification of imagery movements was achieved by a fullyconnected layer, in which cross-entropy was chosen as its loss function. The architecture of the TSTN network is shown in Figure 3.

Training of the TSTN Classifier
It has been demonstrated that both the AO and the MI have significant effects on the modulation of sensorimotor Mu rhythms [42]. The clinical literature in rehabilitation studies found that the AO + MI is an effective way to restore motor function in stroke patients [38]. Since AO and MI activate overlapped motor-related brain areas, it is reasonable to construct an initial classifier based on AO + MI data and then use MI-feedback data to continually train the network for better MI classification results. Figure 4 shows the training and the testing procedure of our TSTN classifier. The classifier was initially trained using all the AO + MI data (900 trials for each subject; 300 trials for each class). The classifier obtained from the AO + MI data, denoted as TSTN AO+MI , was used to test the classification performance in the MI (without feedback) task. Because inaccurate trials in MI training could impede the user's BCI performance [43], only those trials in MI data (without feedback) which were correctly identified were chosen for the continual learning to obtain TSTN MI . The TSTN MI was then applied to test the classification performance in the 1st MI-FB data (with VR feedback), and the correctly classified trials were collected to train a new classifier TSTN MI-FB_1 . The TSTN MI-FB_1 was again used to test the classification performance in the 2nd MI-FB data (with VR feedback), and the correctly classified trials were used to train a final classifier TSTN MI-FB_2 . The final TSTN MI-FB_2 was applied to test the BCI performance in the 3rd MI-FB training dataset and the test performances among different steps were compared to see how the BCI training improved the classifier performances. For the model training, Adam with a learning rate of 0.0002 was utilized and batch size was set as 50. The dropout rate was 0.3 in spatial transforming and 0.5 in temporal transforming [40,44]. Ten-fold cross validation was applied to evaluate the final results, with each fit being performed on a training set consisting of 90% of the total training set selected at random, with the remaining 10% used as a hold out set for validation. Bioengineering 2023, 10, x FOR PEER REVIEW 9 of 20 Figure 3. The architecture of the TSTN network. The EEG signals were prefiltered within 4~40 Hz and then downsampled to 250 Hz. The feature-channel signals were extracted using CSP, and then further enhanced by means of applying spatial transforming with an attention mechanism. The enhanced feature-channel data were segmented into ten embedded patches and the relationships among different temporal patches were perceived using multi-head transforming to obtain distinguishable representations.

Training of the TSTN Classifier
It has been demonstrated that both the AO and the MI have significant effects on the modulation of sensorimotor Mu rhythms [42]. The clinical literature in rehabilitation studies found that the AO + MI is an effective way to restore motor function in stroke patients [38]. Since AO and MI activate overlapped motor-related brain areas, it is reasonable to construct an initial classifier based on AO + MI data and then use MI-feedback data to continually train the network for better MI classification results. Figure 4 shows the training and the testing procedure of our TSTN classifier. The classifier was initially trained using all the AO + MI data (900 trials for each subject; 300 trials for each class). The classifier obtained from the AO + MI data, denoted as TSTNAO+MI, was used to test the classification performance in the MI (without feedback) task. Because inaccurate trials in MI training could impede the user's BCI performance [43], only those trials in MI data (without feedback) which were correctly identified were chosen for the continual learning to obtain TSTNMI. The TSTNMI was then applied to test the classification performance in the 1st MI-FB data (with VR feedback), and the correctly classified trials were collected to train a new classifier TSTNMI-FB_1. The TSTNMI-FB_1 was again used to test the classification performance in the 2nd MI-FB data (with VR feedback), and the correctly classified trials were used to train a final classifier TSTNMI-FB_2. The final TSTNMI-FB_2 was applied to test the BCI performance in the 3rd MI-FB training dataset and the test performances among different steps were compared to see how the BCI training improved the classifier performances. For the model training, Adam with a learning rate of 0.0002 was utilized and batch size was set as 50. The dropout rate was 0.3 in spatial transforming and Figure 3. The architecture of the TSTN network. The EEG signals were prefiltered within 4~40 Hz and then downsampled to 250 Hz. The feature-channel signals were extracted using CSP, and then further enhanced by means of applying spatial transforming with an attention mechanism. The enhanced feature-channel data were segmented into ten embedded patches and the relationships among different temporal patches were perceived using multi-head transforming to obtain distinguishable representations.
Bioengineering 2023, 10, x FOR PEER REVIEW 10 o 0.5 in temporal transforming [40,44]. Ten-fold cross validation was applied to evaluate final results, with each fit being performed on a training set consisting of 90% of the t training set selected at random, with the remaining 10% used as a hold out set for val tion.

Comparing the Detection Performance with Other Classifiers
In order to compare the effectiveness of the TSTN with other classifiers, two con lutional neural network (CNN) based classifiers, the EEGNet [45] and the DeepConv

Comparing the Detection Performance with Other Classifiers
In order to compare the effectiveness of the TSTN with other classifiers, two convolutional neural network (CNN) based classifiers, the EEGNet [45] and the DeepConvNet [46], and three support vector machines (SVMs) with kernel functions of linear, radial basis, and polynomial functions, were tested and compared with TSTN. The training and testing procedures were the same as our training procedures shown in Figure 4.

Results
In this study, we built the initial classifier TSTN AO+MI using AO + MI data, and then performed continual learning with chosen trials in MI, first MI-FB, and second MI-FB datasets to obtain TSTN MI , TSTN MI-FB_1 , and TSTN MI-FB_2 , respectively. Table 1 shows the classification results of using TSTN in our five participants. The test targets for the TSTN AO+MI , TSTN MI , TSTN MI-FB_1 , and TSTN MI-FB_2 were the MI (without feedback), first MI-FB, second MI-FB, and third MI-FB datasets, respectively. The classification accuracies were increased after each continual learning step, in which the classification accuracies averaged over the five participants were 0.63, 0.68, 0.75, and 0.77 for the TSTN AO+MI , TSTN MI , TSTN MI-FB_1 , and TSTN MI-FB_2 , respectively. The specificities were 0.81, 0.84, 0.87, and 0.89 for the TSTN AO+MI , TSTN MI , TSTN MI-FB_1 , and TSTN MI-FB_2 , respectively. The F1 scores averaged over the five subjects were 0.63, 0.68, 0.75, and 0.77, respectively. It is worth noting that the initial TSTN AO+MI had already achieved an accuracy higher than 60% and the detection accuracies kept increasing with the classifier tuning in each continual learning step. This demonstrated the feasibility of using AO + MI data to create an initial classifier model for MI classification.  Figure 5 plots the detected accuracies of the three MI classes in the five subjects by applying the TSTN AO+MI , TSTN MI , TSTN MI-FB_1 , and TSTN MI-FB_2 to the MI (without feedback), first MI-FB, second MI-FB, and 3rd MI-FB datasets, respectively. The detected accuracies obtained from S1 to S5 are shown in Figure 5a-e, and the averages of detection accuracies are shown in Figure 5f. The detected accuracies of individual TSTN AO+MI on MI dataset in each subject were 0.74, 0.58, 0.61, 0.61, and 0.60 for S1, S2, S3, S4, and S5, respectively. For individual TSTN MI on the first MI-FB data in each subject, the detected accuracies were 0.73, 0.66, 0.62, 0.65, and 0.73 for S1, S2, S3, S4, and S5, respectively. For individual TSTN BCI_1 on the second MI-FB data in each subject, the detected accuracies were 0.78, 0.72, 0.73, 0.73, and 0.78 for S1, S2, S3, S4, and S5, respectively. For individual TSTN MI-FB_2 on the third MI-FB data in each subject, the detected accuracies were 0.83, 0.73, 0.75, 0.73, and 0.80 for S1, S2, S3, S4, and S5, respectively. The averages of the detection accuracies over the five subjects are shown in Figure 5f. tively. For individual TSTNMI on the first MI-FB data in each subject, the detected accuracies were 0.73, 0.66, 0.62, 0.65, and 0.73 for S1, S2, S3, S4, and S5, respectively. For individual TSTNBCI_1 on the second MI-FB data in each subject, the detected accuracies were 0.78, 0.72, 0.73, 0.73, and 0.78 for S1, S2, S3, S4, and S5, respectively. For individual TSTNMI-FB_2 on the third MI-FB data in each subject, the detected accuracies were 0.83, 0.73, 0.75, 0.73, and 0.80 for S1, S2, S3, S4, and S5, respectively. The averages of the detection accuracies over the five subjects are shown in Figure 5f. In addition to the model built from individual data, we were interested in testing whether the data collected from different individuals could be pooled to train a BCI In addition to the model built from individual data, we were interested in testing whether the data collected from different individuals could be pooled to train a BCI classifier. The TSTN AO+MI_all , TSTN MI_all , TSTN MI-FB_1_all , and the TSTN MI-FB_2_all were obtained, following the training procedure described in Figure 4, while the measured data over the five subjects were pooled for classifier training. The classification accuracies, specificities, and F1 scores in the five subjects are listed in Table 2 Figure 6a-e presents the detection accuracies using the classifiers trained from the pooled data of the five subjects. The averaged accuracies over the five subjects are shown in Figure 6f. It can be observed that the detection accuracies were increased with the continual training, which might indicate the feasibility of pooling subjects' data to obtain a generally applicable model. The detected accuracies of TSTN AO+MI_all for MI data were 0.65, 0.59, 0.62, 0.63, and 0.58 for S1, S2, S3, S4, and S5, respectively. For TSTN MI_all , the detected accuracies for the first MI-FB data were 0.66, 0.65, 0.62, 0.63, and 0.61 for S1, S2, S3, S4, and S5, respectively. For TSTN MI-FB_1_all , the detected accuracies for the second MI-FB data were 0.71, 0.65, 0.70, 0.68, and 0.65 for S1, S2, S3, S4, and S5, respectively. For TSTN MI-FB_2_all , the detected accuracies for the third MI-FB data were 0.73, 0.67, 0.71, 0.69, and 0.68 for S1, S2, S3, S4, and S5, respectively. To demonstrate the effectiveness of the TSTN network in this study, the EEGNet [45], DeepConvNet [46], and three SVMs with kernel functions of linear, polynomial, and radial basis functions were compared. All the classifiers were built according to the training procedure listed in Figure 4. The detected accuracies obtained from the TSTN network were compared with the five classifiers and the detection accuracies are listed in Table 3. It can be observed that all the classifiers showed increased detection accuracies in our continual learning process. The three deep learning networks (i.e., the TSTN, EEGNet, and DeepConvNet) showed comparable detection accuracies in the third MI-FB data, which were 0.77, 0.74, and 0.75 for TSTN, EEGNet, and DeepConvNet, respectively. For the initial model, the TSTNAO+MI had the highest detection accuracy (0.63) in the detection of MI data, compared to other classifiers. This might have been due to the substantial differences between the AO + MI and MI/MI-FB datasets. The AO + MI task activated both the neural circuitries of action observation and motor imagery, while the MI/MI-FB activated the neural circuitry of motor imagery only. Because the transformer has the greater ability to model long-distance dependencies among temporal embedded patches, it could provide more possibilities to find the dependency between different datasets.  To demonstrate the effectiveness of the TSTN network in this study, the EEGNet [45], DeepConvNet [46], and three SVMs with kernel functions of linear, polynomial, and radial basis functions were compared. All the classifiers were built according to the training procedure listed in Figure 4. The detected accuracies obtained from the TSTN network were compared with the five classifiers and the detection accuracies are listed in Table 3. It can be observed that all the classifiers showed increased detection accuracies in our continual learning process. The three deep learning networks (i.e., the TSTN, EEGNet, and DeepConvNet) showed comparable detection accuracies in the third MI-FB data, which were 0.77, 0.74, and 0.75 for TSTN, EEGNet, and DeepConvNet, respectively. For the initial model, the TSTN AO+MI had the highest detection accuracy (0.63) in the detection of MI data, compared to other classifiers. This might have been due to the substantial differences between the AO + MI and MI/MI-FB datasets. The AO + MI task activated both the neural circuitries of action observation and motor imagery, while the MI/MI-FB activated the neural circuitry of motor imagery only. Because the transformer has the greater ability to model long-distance dependencies among temporal embedded patches, it could provide more possibilities to find the dependency between different datasets.

Discussion
This study aimed to study the feasibility of using EEG data induced from AO + MI to build an initial model for MI-BCI, and the model could be continually improved with the provision of MI data in continual learning. Compared to the MI training in previous BCIs [47], those BCIs requested subjects to perform imagery movements by providing a simple visual or auditory cue. The lack of clear instructions makes it difficult to conform the ways for subjects to perform imagery movements. Therefore, if we can give subjects clear instructions to perform imagery movements, we might have the chance to reduce interindividual difference and obtain generic brain wave patterns for training BCI classifiers. Alimardani et al. (2022) surveyed the impacts of different demographic and psychological variables [48] on the performance of imagery movement BCI in 54 subjects. They suggested the needs of prior BCI training and clear instructions for the experimental protocol are two of the most important factors which can influence a subject's BCI performance. In our previous study [49], the lack of explicit MI instructions would make it difficult for subjects to follow. Subjects may try to develop their own strategies to achieve better BCI accuracies. The distinct strategies in imagery movements could result in large interindividual variations in the induced brain wave patterns, which makes the group analysis of MI data difficult.
In our research, we used the data obtained from the AO + MI task to create an initial model for MI-BCI control. The AO + MI task requested users to passively watch the action of the characters in the VR environment and concurrently imagine themselves performing the same movements. According to [50], Cengiz et al. (2018) studied ERD in the sensorimotor Mu rhythm when the subjects were performing an AO task. Zhanget et al. (2018) also found that AO can promote the effect of motor relearning [51], and the AO task can be performed concurrently with MI (AO + MI) to achieve more effective motor learning or a rehabilitation setting [38]. In our experiment, we asked the subjects to perform imagery movements the same as the actions of the virtual character, in order to achieve a better motor learning effect. It has been reported in the previous literature that the task of AO + MI can be performed either from the first person visual perspective [52] or from the third-person visual perspective [53,54]. Some studies have claimed the motor activities in AO + MI from the third-person visual perspective might involve rotation actions to transform the third-person visual perspective into a first-person imagery perspective [54]. However, it has been reported that the rotation and the transforming effects can be removed with the provision of a clear instruction, by asking subjects to imagine that the observed movements in the third-person perspective is similar to viewing the images in a mirror [38]. In our study, we instructed subjects to imagine that the virtual character is the reflection image of themselves in a mirror. The imagery actions in the following MI and MI-feedback tasks were the same as the actions they viewed in the AO task. Therefore, subjects recalled the images of virtual characters' actions in their MI and MI-FB tasks. Since the imagery movements in the AO + MI, MI, and MI-FB tasks were the same, this could be the reason why the detection accuracy is able to be improved in the continual learning process.
In this paper, we adopted the transformer-based deep learning network proposed by Song et al. (2021) [40] to classify the EEG signals. Compared to traditional ML, traditional ML assumes the data points are dispersed independently and identically. However, in many cases, the acquisition of these data is related to a subject's physiological state (e.g., cognitive state, neural learning, language learning, emotion, electrocardiogram (ECG)). The artificial neural network-based approaches usually have better flexibility to adapt themselves in the learning of sequential data [55]. There are three salient features in the use of TSTN. First, the TSTN utilizes CSP to create discriminable features by designing a set of spatial filters. Since distinct brain areas are responsible for different brain functions [56], the use of CSP can create a specific spatial filter to extract brain activities induced from a particular task. Second, the TSTN utilizes self-attention to enhance the feature-channel data extracted from CSP. In contrast to other popular approaches using convolutional neural networks (CNNs) [45,46], the classification of CNN is related to the selection of kernel, in which large convolutional kernels are used to capture high-resolution details and small convolutional kernels are used to extract low-resolution features. The TSTN does not have to decide the kernel size. It applies self-attention on the feature-channel data in order to weight those channels which are relevant to the performing of a subject's task. Moreover, the CNN does not consider the dependency of time-series information. Third, the TSTN slices the enhanced feature-channel data into signal patches and uses the multi-head transforming to perceive the dependencies along the temporal dimension. Compared to the previous literature using recurrent neural network (RNN) or long short-term memory networks (LSTMs) [57], both RNN and LSTM have the problems of vanishing gradients [58], so that RNN or LSTM-based solutions might have problem of dealing with EEG data induced from an MI task with a longer execution time. For example, stroke patients have slower information processing in moving the affected hand and require a longer execution time for performing mental tasks [59]. One noteworthy disadvantage of the transformer-based network is its huge model size. The TSTN requires considerable computing power and training time to achieve good accuracy. Implementation of the proposed model on wearable devices might be difficult.
In order to show the model interpretability of the TSTN neural network, the correctly classified trials were used to improve the classifiers in the continual learning processes. The relative power change of each correctly classified trial was calculated, in which the relative event-related power change was calculated using the ERD/ERS technique, formulated by Pfurtscheller and Lopes da Silva (1999) [60]. The ERD/ERS describes that the power decrease/increase within a specific frequency band anchored to a given event is calculated relative to the power of a reference period. The ERD/ERS represent the percentages of power changes according to the reference period, in which the relative power change can be represented as follows: where RP% is the percentage of a relative power change within alpha band (8~13 Hz) [61], A is the power of event-induced signal power, and R is the power in the reference period. In Equation (11), the A is calculated from the period of the AO + MI or MI time block, and the reference period R is calculated from the resting period, from −2 s to 0 s, preceding the action of a virtual character or the presence of the visual cue for imagery movement. Figure 7 shows the RPs obtained from the average of the five subjects in the four datasets. In the MI and MI-FB datasets, only the RPs of the correctly classified trials were calculated. The RP of the AO + MI tasks in viewing the left-hand and right-hand movements are shown in Figures 7a and 7b, respectively. The RP changes in left-hand and right-hand MI tasks are shown in Figures 7c and 7d, respectively. The RP changes of the first MI-FB, second MI-FB, and third MI-FB datasets in the left-hand and right-hand movements are shown in Figures 7e and 7f, Figures 7g and 7h, and Figures 7i and 7j, respectively. It can be observed that the ERD (the blue area of each plot in Figure 7) was enhanced with the continual learning process. This demonstrated that the TSTN network can effectively pick up those EEG trials with prominent ERDs. In Figure 7, only the ERD within the alpha band was demonstrated, because the ERD in the alpha band has been reported as the most indicative parameter in operating an MI-BCI system [61]. The BCI performance of our TSTN network was improved with the continual learning process; not only was the detection accuracy improved (see Tables 1 and 2), but the value of ERD was also enhanced (see Figure 7). The implication between ERD and the classification accuracy echoes the findings in the previous literature that ERD is an important parameter in MI-BCI training [61]. Nevertheless, we did not see the phenomenon of ERD lateralization as mentioned in previous MI studies [62]. This might have been due to the difference in experiment design between our study and the other previous literature. In our study, we requested subjects to perform AO + MI and recall the action imagination in the following MI and MI-FB tasks. The memory recall of action imagery in MI and MI-FB tasks could involve the mirror neuron system, which presents the activations in the bilateral primary motor cortex, the primary somatosensory cortex, and the middle frontal cortex [51]. According to the study proposed by Rizzolatti et al. (2010) [63], the bilaterally distributed parietofrontal network in the mirror neuron system serves as a neural substrate to achieve the transformation of visual information into motor execution (i.e., visuomotor transformation).
Previous BCI studies only gave subjects a simple visual or auditory cue (e.g., visual or auditory cues, etc.) to perform imagery movements [64,65]. Unclear instruction could make it difficult for subjects to follow the experimenter's guidance to achieve the desired task [65]. Since the AO task has been reported as an effective way to activate the human motor cortex with brain wave patterns similar to those induced by the MI task, we found it was easier for subjects to understand the experimenter's instruction by requesting them to watch and follow the virtual character's actions, instead of just providing them with simple textual instruction or visual/auditory cues. To the best of our knowledge, our research is the first study to train the BCI classifiers using EEG signals induced from AO + MI and MI tasks. Current BCI databases (e.g., BNCI Horizon; http://bnci-horizon-2020.eu/database/ data-sets, accessed on 12 October 2022) do not collect both the EEG signals of AO + MI and MI tasks for the same subjects. The collected EEG data in this study could inspire future research on studying the relevance between observation and imagery movements.
One limitation of our current study is the small sample size. Because we wanted to demonstrate the training transition of a BCI classifier from AO + MI to MI-FB tasks, the subject had to join one experiment in one week and it took four weeks to complete the data collection for a subject. The small amount of data was not sufficient to have an effective comparison for the detection performances among different classifiers (see Table 3). Nevertheless, this paper aimed to study the feasibility of continual learning in a BCI classifier, from AO + MI data to MI-FB data. We observed that the detection accuracies were improved in all classifiers, and all the deep learning classifiers (i.e., TSTN, EEGNet, and DeepConvNet) showed superior performances than those in the use of traditional SVMs. The second limitation of this study is the huge computation load of our transformer framework. The transformer-based classifier has a large model size which has difficulties in coping with fast updating or fluctuation conditions.

Conclusions
In this study, we designed a BCI training method in the virtual reality (VR) environment. Subjects wore a head-mounted device (HMD) [32] and started MI training from the AO + MI task in VR environments. Unlike other MI-BCI studies which asked naïve subjects to perform unfamiliar MI tasks, the training procedure provided an easier way for subjects to conform their imagery actions. The AO-oriented training procedure also provided a more flexible design for BCI training. Subjects were only requested to follow the movements of the virtual characters and simultaneously performed imagery actions. The utilization of the AO + MI data to build the initial classifier model had several advantages. First, the instruction of AO + MI was much easier for subjects to follow [36] compared to the traditional MI task which prompts subjects to perform imagery actions by providing a simple visual/auditory cue. Second, the AO + MI provided a convenient and standardized experimental protocol. Third, the AO + MI elicited increased neural activities in motor-related brain areas, relative to the use of AO or MI only [38]. Fourth, the AO + MI has been proved as an effective way for motor learning and rehabilitation in clinics [66]. This study has answered the following issues: (1) the effectiveness of using AO + MI data to build an initial model for MI classification was validated; (2) the use of a continual learning process for the improvement of classifier performance was demonstrated; (3) the feasibility of a transformer-based deep learning model in MI-BCI classification was demonstrated; (4) the interpretability of the proposed TSTN model was shown in the analysis of alpha ERD in different BCI training steps. Our current study achieved a mean detection accuracy of 77% over the five participants in the three-class classification. In future applications, experimenters will be able to change the actions of the virtual character to train wanted imagery actions, which is important for the use of BCI in metaverse applications.