Learning Optimal Time-Frequency-Spatial Features by the CiSSA-CSP Method for Motor Imagery EEG Classification

The common spatial pattern (CSP) is a popular method in feature extraction for motor imagery (MI) electroencephalogram (EEG) classification in brain–computer interface (BCI) systems. However, combining temporal and spectral information in the CSP-based spatial features is still a challenging issue, which greatly affects the performance of MI-based BCI systems. Here, we propose a novel circulant singular spectrum analysis embedded CSP (CiSSA-CSP) method for learning the optimal time-frequency-spatial features to improve the MI classification accuracy. Specifically, raw EEG data are first segmented into multiple time segments and spectrum-specific sub-bands are further derived by CiSSA from each time segment in a set of non-overlapping filter bands. CSP features extracted from all time-frequency segments contain more sufficient time-frequency-spatial information. An experimental study was implemented on the publicly available EEG dataset (BCI Competition III dataset IVa) and a self-collected experimental EEG dataset to validate the effectiveness of the CiSSA-CSP method. Experimental results demonstrate that discriminative and robust features are extracted effectively. Compared with several state-of-the-art methods, the proposed method exhibited optimal accuracies of 96.6% and 95.2% on the public and experimental datasets, respectively, which confirms that it is a promising method for improving the performance of MI-based BCIs.


Introduction
Brain-computer interface (BCI) systems build a direct connection between the human brain and external devices, bypassing peripheral nerves and muscles [1]. BCIs not only help disabled patients effectively regain or recover motor function [2], but also have many promising applications for healthy users, such as gaming, car control [3], and fatigue detection [4]. Among BCI systems, motor imagery (MI)-based BCIs are more flexible than other types of BCIs because they can be driven by voluntary brain activities without external stimulation and can be more intuitive to control [5,6]. During MI, the sensorimotor rhythms are attenuated and then enhanced in a short time, which is known as event-related desynchronization/synchronization (ERD/ERS) [7]. Generally, the signal process of a MI EEG-based BCI system contains three stages: EEG signal recording, feature extraction, and classification. Among these, feature extraction is challenging due to non-stationarity and a low signal-to-noise ratio, which can affect the performance of MI-based BCIs [8].
To optimally extract EEG features that describe the ERD/ERS phenomenon, the common spatial pattern (CSP) algorithm, which seeks spatial filters to extract the classdiscriminative spatial features [9], is frequently adopted due to its good performance. However, the performance of CSP is strongly affected by the frequency bands for sensorimotor rhythm extraction and the time period for analysis in the EEG signal [10]. Combining the spectral and temporal information in the CSP-based spatial feature is challenging. The effectiveness of CSP depends on identifying the optimal EEG frequency bands. Because the optimal frequency band is subject-specific, a fixed and broad frequency band , which is commonly used, is not suitable for all cases [11]. The classification accuracy may be decreased due to a poorly selected filter band that does not include sufficient spectral information [12]. To solve this issue, several extensions of CSP have been proposed to use narrowband information from different frequency bands, such as sub-band CSP (SBCSP) [13], filter bank CSP (FBCSP) [14], discriminative FBCSP (DFBCSP) [15], and sparse FBCSP (SFBCSP) [16]. These methods usually adopt a finite impulse response (FIR) filter or infinite impulse response (IIR) filter to obtain the sub-bands with different frequency bands, which cannot remove noise and artifacts overlapping in time-frequency space with the MI EEG signal [17]. These noise and artifacts decrease the classification accuracy of the CSP method.
In addition to frequency band optimization, another significant but often ignored issue in CSP is time window optimization for EEG segmentation that captures discriminative features [12]. An appropriate time window for EEG should be preselected to cover the significant ERD/ERS patterns when the imagery is activated and remove the unrelated time interval when the imagery is over. Many of the previous studies [11,16,[18][19][20][21] have adopted a fixed and predefined time window (i.e., 0.5-2.5 s after the cue) for feature extraction of MI-related EEG. However, the optimal EEG time window varies over time and across subjects [22]. The use of a fixed time window can hardly capture discriminative temporal features for all subjects, and hence results in poor classification performance. In recent years, an increasing number of researchers have suggested that the optimization of the time window can significantly improve classification accuracy. Wang et al. introduced two Parzen window-based methods to select subject-specific time segments from 21 overlapping time window candidates [23]. Huang et al. [24], Miao et al. [6], and Kirar et al. [25] introduced methods that simultaneously optimize time windows and frequency sub-bands within the CSP to improve the performance of MI classification. Jin et al. designed a novel time filter that acted together with the spatial filter to introduce the temporal information in the spatial features, and discussed the effect of different lengths of time windows to obtain the optimal time segments [26]. Therefore, it is necessary to identify optimal and task-related frequency sub-bands and relevant time segments of EEG data to improve the performance of MI classification.
When frequency sub-bands and the time window are optimized, the time-frequencyspatial features from multichannel EEG recordings lead to a very high dimensional feature space. However, the high dimension feature space may contain irrelevant features [25] and have overfitting problems that inevitably diminish discriminative information [24]. These irrelevant and overfitting features decrease the performance of MI classification. In order to obtain a relevant subset of features and reduce the dimensionality of the feature space, the feature fusion method is used to obtain the optimal features. Feature fusion usually contains feature selection and dimensionality reduction. Feature selection is the process of finding the most effective features from the available feature set to improve algorithm performance. In the MI classification task, commonly used feature selection methods include L1-norm [18], Fisher score [27], mutual information [28], and neighborhood component analysis (NCA) [21]. Dimensionality reduction is a process of removing redundant variables, leaving the significant variables to improve the accuracy the classification tasks [29]. Principal component analysis (PCA) is a common and popular dimensionality reduction method.
In order to optimize the frequency band and time window for the combination of spectral and temporal information in the CSP-based spatial features, we propose a novel circulant singular spectrum analysis embedded CSP (CiSSA-CSP) method for learning the optimal time-frequency-spatial features to improve the classification accuracy of MIrelated EEG. Prior to extracting features using CSP, the raw EEG data are segmented into multiple sub-time segments, from which spectrum-specific sub-bands are derived in a set of non-overlapping filter bands using CiSSA. The embedded CiSSA not only introduces additional spectral information in features, but suppresses noise and artifact overlapping in the frequency space with the EEG signal. Instead of adopting all the time-frequency-spatial features for classification, we devised a feature fusion based on mutual information or principal component analysis (PCA) to reduce redundant information and extract optimal CSP features. Thus, the MI classification accuracy is improved by the proposed method.

Methods
The overall framework of the proposed CiSSA-CSP method for motor-imagery classification is illustrated in Figure 1, including time segmentation, sub-band filtering, CSP feature extraction, and feature fusion. Specifically, the multi-channel EEG signals are segmented into T = 4 segments with overlapping time using a sliding window. Every time segment is bandpass filtered into B = 6 sub-bands in a set of non-overlapping filter bands using the CiSSA method. Then, spatial features are extracted in every sub-band across all time segments by CSP, and the feature vector F i ∈ R 2MBT is obtained. Finally, the feature fusion method, including mutual information or PCA, is used to obtain the optimal features, which are then fed into the SVM for MI classification. A simple linear kernel and the constraint C = 1 is adopted for the SVM training.
in the frequency space with the EEG signal. Instead of adopting all the time-frequ spatial features for classification, we devised a feature fusion based on mutual inform or principal component analysis (PCA) to reduce redundant information and extrac timal CSP features. Thus, the MI classification accuracy is improved by the prop method.

Methods
The overall framework of the proposed CiSSA-CSP method for motor-imagery sification is illustrated in Figure 1, including time segmentation, sub-band filtering feature extraction, and feature fusion. Specifically, the multi-channel EEG signals are mented into T = 4 segments with overlapping time using a sliding window. Every segment is bandpass filtered into B = 6 sub-bands in a set of non-overlapping filter b using the CiSSA method. Then, spatial features are extracted in every sub-band acro time segments by CSP, and the feature vector is obtained. Finally, the fe fusion method, including mutual information or PCA, is used to obtain the optima tures, which are then fed into the SVM for MI classification. A simple linear kerne the constraint C = 1 is adopted for the SVM training.

Time Segmentation of EEG Signal
MI EEG signals have distinct temporal behavior and a transient nature [22]. CS tures extracted from the whole time period do not carry any temporal information as

Time Segmentation of EEG Signal
MI EEG signals have distinct temporal behavior and a transient nature [22]. CSP features extracted from the whole time period do not carry any temporal information as EEG signals are averaged over time to compute the covariance matrix. Therefore it is crucial to select the optimal time window and focus on the local properties of the EEG. In order to combine the temporal information in the CSP features, the raw EEG signals are segmented into T segments with overlapping time using a sliding window of length 2 s, which can discriminate different motor imagery stages.

Sub-Band Filtering Using CiSSA
In order to further combine the spectral information in the CSP features, the subbands are composed using the CiSSA method to perform bandpass filtering on all time segments. The CiSSA method is a nonparametric signal extraction method proposed by Juan Bógalo [30]. The CiSSA is derived from the singular spectrum analysis (SSA), which can suppress noise and artifacts with overlapping frequencies compared with the narrowband filter methods such as FIR and IIR [17]. It can decompose the signal into a set of reconstructed components (RCs) of known frequencies. CiSSA consists of four steps: embedding, decomposition, diagonal averaging, and grouping. In the time-delay embedding step, every single-channel EEG time series s = (s 1 , s 2 , . . . , s N ) T (superscript T denotes the transpose of a vector) is mapped onto a multidimensional trajectory matrix X using a sliding window with the window length L. In the decomposition step, the trajectory matrix is decomposed into elementary matrices of rank 1 that are associated with different frequencies. To do so, a related circulant matrix C L is built based on the second order moments of the time series [30]: where: The eigenvalues and eigenvectors of C L , respectively, are given by [31]: where f (•) denotes the power spectral density of the signal. H indicates the conjugate transpose of a matrix. The k-th eigenvalue and corresponding eigenvector is associated with the specific frequencies given by: where f s is the sampling rate of EEG signals. Then, in the diagonal averaging step [32], several time series are reconstructed from the elementary matrices. The reconstructed time series are generally called RCs. Thus the raw EEG signal is decomposed into several RCs of known frequency given by Equation (4). The frequency bandwidth of each RC can be roughly expressed by [33]: As a consequence, the frequency bandwidth of each RC is limited to f s /L. Considering the frequency of each RC given by Equation (4), there is no frequency mixing between different RCs.
We perform bandpass filtering on all time segments using the CiSSA method to obtain a set of non-overlapping sub-bands ( f b 1 , f b 2 , . . . , f b B ). These sub-bands are chosen from the frequency range 6-30 Hz with bandwidth of f b = 4 Hz, i.e., f b 1 = 6-10 Hz, f b 2 = 10-14 Hz, . . ., f b B = 26-30 Hz where B = 6. Then, feature extraction is performed on every sub-band using CSP.

Feature Extraction Using Common Spatial Patterns
Consider two classes of EEG signal X i,1 and X i,2 R C×P recorded from the i-th trial, where C is the number of channels, and P denotes the number of sample points. The spatial covariance matrix ∑ of the class l (l = 1, 2) is given by: where N l is the number of trials in class l. CSP aims at finding linear transforms (spatial filters) to maximize discrimination between two classes [16]. In order to achieve maximum separability between the variance of two classes, the Rayleigh quotient J(w) is introduced: where ||•|| 2 denotes the l 2 -norm and w ∈ R C is a spatial filter. The maximization of Rayleigh quotient J(w) can be achieved by solving the generalized eigenvalue problem: Σ 1 w = λΣ 2 w. The learned linear transforms (spatial filters) matrix W = [w 1 , w 2 , · · · , w 2M ] can be obtained by collecting eigenvectors corresponding to the M largest and smallest generalized eigenvalues, which represent maximum discrimination between two classes. The spatial filtered EEG, which is the projection Z of EEG signal X, is then given by Z = W T X.
The variance based CSP feature vector is then formed as F = [F 1 , F 2 , · · · , F 2M ], where M = 2, F i is given by [11]: where var(Z i ) denotes the variance of i-th row of Z. CSP is implemented on the segmented and filtered signals in each sub-band to calculate the corresponding features by Equation (8). As a result, 2MBT = 96 features are extracted from each EEG sample.

Feature Fusion
The method described above leads to a high-dimension feature set (dimension = 96) that is highly correlated. Obviously using all features for the final decision is not very efficient due to over-learning problems in high dimensions [13]. Therefore, dimension fusion steps are needed to reduce the feature dimensions and improve the performance of classification. We studied two common approaches to obtain a lower dimensionality subset and use them for final classification, namely, mutual information for feature selection and PCA for dimensionality reduction. We feed the reduced feature space to the support vector machine (SVM) and investigate the performance of EEG classification.

Mutual Information
The mutual information-based individual feature (MIBIF) algorithm is a feature selection method that shows good performance in the CSP-based method [34]. For the feature vector set F = {F 1 , F 2 , . . . , F d }, d = 2MBT, and the corresponding class label Ω = {1, 2}, the mutual information of each feature is calculated: where H(Ω) = −∑ 2 ω=1 p(ω) log 2 p(ω), ω ∈ Ω, the conditional entropy is: A higher magnitude of mutual information means more relevance between the feature and the class. Thus, the features are ranked in descending order according to mutual information and the top k significant features are selected.

PCA
PCA is a useful approach to decorrelate the features and reduce the dimensionality of the feature space [35]. The purpose of PCA is to find the linear orthogonal transformation matrix that maximally maintains the feature variance [36]. The mean feature vector , d denotes the number of features. Then, covariance matrix C PCA for F is calculated as follows: The PCA projection matrix W PCA can be obtained by calculating the eigenvectors and eigenvalues for the covariance matrix and selecting the top k columns of eigenvectors in descending order of eigenvalue sizes.

Data and Experiment
In order to better verify the validity of the proposed CiSSA-CSP method, we used two different MI EEG datasets for analysis. The first dataset was the BCI Competition III dataset IVa, which is publicly available and has been used in many studies [34,37]. Therefore, using this dataset, we can effectively compare our method with competing methods. In addition, in order to verify the universal applicability of the method, the second dataset, which was collected by ourselves, was used for analysis and validation.

Public EEG Dataset
BCI Competition III dataset IVa was recorded from five healthy subjects (subject aa, al, av, aw, and ay). The subjects sat in a comfortable chair and performed motor imagery (right hand and right foot) experiments. The EEG signal was recorded using 118 channels according to the extended international 10-20 system and 140 trials for each class. Thus, a total of 280 trials were provided for each subject. The sampling rate of the EEG data was 100 Hz. Each trial lasted 3.5 s of motor imagery and was interrupted by a time period of 1.75 to 2.25 s, in which the subject could relax, shown in Figure 2b. Seventeen EEG channels were selected in our study, as shown in Figure 2a (FC3, FC1, FCz, FC2, FC4, C5, C3, C1, Cz, C2, C4, C6, CP3, CP1, CPz, CP2, CP4), which contain the sensorimotor area needed to recognize the cue in the experiment [34].

Experimental EEG Dataset
The experiments were approved with a protocol (NO. 20170010) by the Institutional Review Board of Tsinghua University and written informed consent was obtained from the subjects. Twenty healthy subjects (subject S1, S2, . . . , S20) aged 20-29 participated in the experiments and abstained from psychoactive substances for at least 4 h prior to the experiments. The experiments were carried out with the subjects sitting on a comfortable chair in a room with normal lightness. The experimental EEG signals were recorded with nine electrodes (F3, Fz, F4, C3, Cz, C4, P3, Pz, P4) from the international 10-20 system, shown in Figure 3a, using the MP160 data acquisition and analysis system (BIOPAC Systems, Inc., Goleta, CA, USA). During each trial, as shown in Figure 3b, the subject relaxed for 3 s, and then a visual cue was presented. Two seconds later, the subject performed the right-hand or right-foot motor imaginary tasks for 5 s. There were 140 trials for each class per subject, i.e., a total of 280 trials for each subject. All EEG signals were recorded at a sampling rate of 250 Hz.    The scheme of the experiment. A single tri divided into three periods. In the first period, the subject relaxed for 3 s; were indicated for 2 s for preparation. Finally, subjects performed the mo hand or foot) for 5 s.

Results and Discussion of Public EEG Dataset
The 17-channel EEG signals of all trials were segmented into overlapping time of 1.5 s (0-2 s, 0.5-2.5 s, 1-3 s, 1.5-3.5 s). Then, ev bandpass filtered into B = 6 sub-bands without overlapping frequ method (6-10 Hz, 10-14 Hz, 14-18 Hz, 18-22 Hz, 22-26 Hz, 26-30 features were extracted from every sub-band by CSP and, thus, 2M obtained. Finally, dimension reduction was conducted by mutual in optimal features are selected for MI classification. A 10-fold crossmented to evaluate the classification performance. Table 1 shows the classification accuracy of different algorithm ing 10-fold cross-validation. The classification performance of sta good, especially for subject aa, av, and aw. The average classificat 81.6%. We refer to classification results obtained with CiSSA filte CSP as CiSSA + CSP. Classification results indicated that CiSSA + C The scheme of the experiment. A single trial of the experiment was divided into three periods. In the first period, the subject relaxed for 3 s; and then the visual cues were indicated for 2 s for preparation. Finally, subjects performed the motor-imagery tasks (right hand or foot) for 5 s.

Results and Discussion of Public EEG Dataset
The 17-channel EEG signals of all trials were segmented into T = 4 segments with overlapping time of 1.5 s (0-2 s, 0.5-2.5 s, 1-3 s, 1.5-3.5 s). Then, every time segment was bandpass filtered into B = 6 sub-bands without overlapping frequency using the CiSSA method (6-10 Hz, 10-14 Hz, 14-18 Hz, 18-22 Hz, 22-26 Hz, 26-30 Hz). A total of 2M = 4 features were extracted from every sub-band by CSP and, thus, 2MBT = 96 features were obtained. Finally, dimension reduction was conducted by mutual information or PCA and optimal features are selected for MI classification. A 10-fold cross-validation was implemented to evaluate the classification performance. Table 1 shows the classification accuracy of different algorithms for five subjects using 10-fold cross-validation. The classification performance of standard CSP is not very good, especially for subject aa, av, and aw. The average classification accuracy of CSP is 81.6%. We refer to classification results obtained with CiSSA filtered sub-bands before CSP as CiSSA + CSP. Classification results indicated that CiSSA + CSP provides improvements compared to CSP for all subjects, especially for aa, av, and aw. The average classification accuracy of CiSSA + CSP is 92.3%, which is much higher than the accuracy of CSP. This proves that combining spectral information in the CSP features can greatly improve the performance of MI classification. The results corresponding to the Subtime + CiSSA + CSP are obtained with time segmentation processing before CiSSA + CSP. With the exception of subject aw, the classification accuracy increases with all subjects when time segmentation is implemented. The average accuracy of Subtime + CiSSA + CSP increases to 94.5%, proving that combining temporal information in the CSP features can further improve the classification accuracy of MI EEG. The results obtained with MIBIF and PCA processing as dimensionality reduction are referred as Subtime + CiSSA + CSP + MIBIF and Subtime + CiSSA + CSP + PCA, respectively. Note that k = 9 optimal features are selected for all subjects to preliminary study the effects of MIBIF and PCA. When MIBIF is used as the feature selection method, the classification accuracy decreases for subjects aa, al, and av, and the average classification accuracy decreases to 93.6%. This indicates that nine optimal features are not enough to carry sufficient discriminative information. Further studies should be conducted to select suitable and optimal features. When PCA is used for dimensionality reduction, the classification accuracy increases slightly with all subjects except for subject aa. Subtime + CiSSA + CSP + PCA provides the best results with an average classification accuracy of 96.4%. In order to understand the effect of the combination of spectral information with CSP features, we visualized the topographical distribution of the broad band and the subband EEGs. Figure 4 presents the topographical map and the filter coefficient of the most significant spatial filter learned by the CSP method from the broad band and all sub-bands for subject av. Only the electrodes of the sensorimotor area (inside the red dotted frame) are presented in the topographical map. A larger absolute value of the filter coefficient means more discriminative information [16]. It can be seen that the largest filter coefficient of the most significant spatial filter in the broad frequency band (6-30 Hz) is only 0.44, which leads to poor separability. The largest filter coefficients of the most significant spatial filter in sub-bands 6-10 Hz, 10-14 Hz, 14-18 Hz, 18-22 Hz, 22-26 Hz, and 26-30 Hz are −0.46, 0.43, 0.6, 0.66, 0.59 and 0.77, respectively. This indicates better separability in Beta rhythm sub-bands (14-18 Hz, 18-22 Hz, 22-26 Hz, and 26-30 Hz) than in Mu rhythm sub-bands (6-10 Hz and 10-14 Hz) and broad band (6-30 Hz) for subject av. More discriminative spectral information is combined in the Beta rhythm sub-bands features. Therefore, we need to find more precise frequency sub-bands for MI CSP feature extraction, since these sub-bands carry the most discriminative information and the remaining sub-bands are irrelevant and redundant to the MI tasks. It is concluded that the combination of spectral information in the CSP features by an effective optimization of filter band is necessary to improve the MI classification performance.
In the study, the sub-bands of the EEG were extracted by the CiSSA. In order to compare the performance of sub-band extraction and classification with other common filtering methods, the sub-bands were extracted by FIR filtering with order 60, Butterworth IIR filtering with order 7, and the wavelet decomposition (WDec) methods, and the classification accuracies were calculated on CSP features extracted from these sub-bands. Furthermore, the independent component analysis (ICA) method is commonly used in signal decomposition and artifact removal of EEG [38]. The noise and artifacts are removed by the FastICA method based on Negentropy [38] and then the sub-bands are extracted by FIR filtering. Table 2 shows the classification accuracies of CSP, FIR + CSP, IIR + CSP, WDec + CSP, ICA + CSP, ICA + FIR + CSP, and CiSSA + CSP methods on BCI Competition III dataset IVa. It can be seen that, compared with the standard CSP on the broad EEG band, the classification accuracies are highly improved for sub-bands by FIR, IIR, WDec and CiSSA. This proves that combining spectral information can greatly improve the discrimination of CSP features. Expert for subject ay, the classification accuracies obtained by CiSSA are higher than those obtained by FIR, IIR, and WDec for all subjects. The average classification accuracy obtained by CiSSA is improved by 2.3%, 2.9%, and 1.9% over the average classification accuracies obtained by FIR, IIR, and WDec, respectively. This is because the CiSSA can suppress noise and artifacts with overlapping frequencies of sub-bands, while the FIR, IIR, and the WDec methods are not able to separate the noise overlapping in the frequency space, which decreases the classification performance of CSP. Figure 5 shows the power spectrum density (PSD) of the sub-bands extracted by CiSSA, FIR, IIR, WDec, and ICA + FIR for subject av at electrode C3. It can be seen that the PSDs of sub-bands extracted by FIR and IIR are higher than those by CiSSA and ICA + FIR, which can suppress noise and artifacts with overlapping frequencies. The PSDs of sub-bands extracted by WDec contain components falling outside the frequency width of sub-bands. Although ICA can also remove noise and artifacts with overlapping frequencies, the average classification accuracy obtained by ICA + FIR is lower than that obtained by CiSSA. It is concluded that the CiSSA extracts more precise frequency sub-bands for MI CSP feature extraction. bands, while the FIR, IIR, and the WDec methods are not able to separate the noise overlapping in the frequency space, which decreases the classification performance of CSP. Figure 5 shows the power spectrum density (PSD) of the sub-bands extracted by CiSSA, FIR, IIR, WDec, and ICA + FIR for subject av at electrode C3. It can be seen that the PSDs of sub-bands extracted by FIR and IIR are higher than those by CiSSA and ICA + FIR, which can suppress noise and artifacts with overlapping frequencies. The PSDs of subbands extracted by WDec contain components falling outside the frequency width of subbands. Although ICA can also remove noise and artifacts with overlapping frequencies, the average classification accuracy obtained by ICA + FIR is lower than that obtained by CiSSA. It is concluded that the CiSSA extracts more precise frequency sub-bands for MI CSP feature extraction.     The bandwidth of sub-bands is 4 Hz, which is used in most of the previous stud [8,10,12,24,39]. Table 3 shows the classification accuracies of CiSSA + CSP method on d ferent bandwidths on BCI Competition III dataset IVa using 10-fold cross-validation can be seen that the classification accuracy attains a high value when the bandwidth is to be 1 or 4 Hz. However, more computing resources and time are needed for a bandwid of 1 Hz than for a bandwidth of 4 Hz. Therefore, the bandwidth of 4 Hz is the best cho for sub-band extraction.  The bandwidth of sub-bands is 4 Hz, which is used in most of the previous studies [8,10,12,24,39]. Table 3 shows the classification accuracies of CiSSA + CSP method on different bandwidths on BCI Competition III dataset IVa using 10-fold cross-validation. It can be seen that the classification accuracy attains a high value when the bandwidth is set to be 1 or 4 Hz. However, more computing resources and time are needed for a bandwidth of 1 Hz than for a bandwidth of 4 Hz. Therefore, the bandwidth of 4 Hz is the best choice for sub-band extraction.

The Performance of Time Segmentation
To present time window segmentation performance, we made topoplots of spatial filters for subject aa as an example, shown in Figure 6. Figure 6a shows the classification accuracy of the feature space learned by the proposed CiSSA-CSP method using a pictorial representation. Overall, the feature space has five time windows (the whole time window and four sub-time windows) and six frequency sub-bands for each time window. Each time-frequency segment contains four CSP features. It can be observed that CSP feature index 8 (sub-band 10-14 Hz), index 12 (sub-band 14-18 Hz), and index 20 (sub-band 22-26 Hz), which represent the most significant spatial filters learned by the CSP from the sub-bands, attain the best classification accuracy. Features from CSP feature index 12 in all time windows are marked by a red outline. The classification accuracy of sub-time window 0.5-2.5 s is higher than the accuracies of the whole time window of 0-3.5 s and other sub-time windows of 0-2 s, 1-3 s, and 1.5-3.5 s. To further analyze the effect of the proposed method in different time windows, the topographical maps of the most significant spatial filter learned by the CSP from all time windows in sub-band 14-18 Hz (marked by red outline in Figure 6a) are shown in Figure 6b. An evident change in ERD/ERS patterns in the sensorimotor area is observed as the time window changes, which shows that the neural response during motor imagery tasks changes with time. In sub-time windows 0.5-2.5 s, spatial features are more discriminative and significant than in other sub-time windows and the whole time window. Therefore, combining temporal information into CSP features by time segmentation leads to more discriminatory features for MI task classification. To present time window segmentation performance, we made topoplots of spatial filters for subject aa as an example, shown in Figure 6. Figure 6a shows the classification accuracy of the feature space learned by the proposed CiSSA-CSP method using a pictorial representation. Overall, the feature space has five time windows (the whole time window and four sub-time windows) and six frequency sub-bands for each time window. Each time-frequency segment contains four CSP features. It can be observed that CSP feature index 8 (sub-band 10-14 Hz), index 12 (sub-band 14-18 Hz), and index 20 (sub-band 22-26 Hz), which represent the most significant spatial filters learned by the CSP from the sub-bands, attain the best classification accuracy. Features from CSP feature index 12 in all time windows are marked by a red outline. The classification accuracy of sub-time window 0.5-2.5 s is higher than the accuracies of the whole time window of 0-3.5 s and other sub-time windows of 0-2 s, 1-3 s, and 1.5-3.5 s. To further analyze the effect of the proposed method in different time windows, the topographical maps of the most significant spatial filter learned by the CSP from all time windows in sub-band 14-18 Hz (marked by red outline in Figure 6a) are shown in Figure 6b. An evident change in ERD/ERS patterns in the sensorimotor area is observed as the time window changes, which shows that the neural response during motor imagery tasks changes with time. In sub-time windows 0.5-2.5 s, spatial features are more discriminative and significant than in other sub-time windows and the whole time window. Therefore, combining temporal information into CSP features by time segmentation leads to more discriminatory features for MI task classification.  Figure 6a). Electrodes inside red outline in Figure 6b represent the electrodes of the sensorimotor area.
The performance of classification is affected by time-window length. The classification accuracies at different time-window lengths with a window step of 0.5 s were calculated using 10-fold cross-validation and the results are shown in Table 4. It can be seen  Figure 6a). Electrodes inside red outline in Figure 6b represent the electrodes of the sensorimotor area. The performance of classification is affected by time-window length. The classification accuracies at different time-window lengths with a window step of 0.5 s were calculated using 10-fold cross-validation and the results are shown in Table 4. It can be seen that the classification accuracies vary within a small range of 1.2% and the accuracy attains a maximum value at time-window length of 2 s. To have an intuitive understanding of the distribution of significant time-frequency segments, the values of MIBIF belonging to each time-frequency segment were calculated, as shown in Figure 7 for subject aa. It can be seen that the highest values are located in feature indexes 8, 12, and 20, which is consistent with Figure 6. It is concluded that the features of higher MIBIF values contain more discriminatory information for accuracy improvement of MI EEG. In addition, the significance (MIBIF value) changes along the time axis and frequency bands due to the non-stationarity of EEG. The most significant features are located in some local time-frequency segments. Therefore, it is believed that decomposing a multi-channel EEG into time-frequency segments for more precise analysis helps to improve the classification accuracy. Figure 8 depicts distributions of the most significant two features derived by CSP, CiSSA + CSP, Subtime + CSP, and Subtime + CiSSA + CSP, for subject aa. It is indicated that when the spectral (CiSSA + CSP) or temporal (Subtime + CSP) information is combined into the CSP features, more separable feature distributions are provided in comparison with the standard CSP. The highest discriminability of features was achieved by Subtime + CiSSA + CSP, which combines both the spectral and temporal information. The classification accuracy of the two most significant features derived by Subtime + CiSSA + CSP is 85.0%, 10.7% higher than the classification accuracy obtained by standard CSP.

The Effect of Feature Selection by MIBIF
To have an intuitive understanding of the distribution of significant time-freque segments, the values of MIBIF belonging to each time-frequency segment were calcula as shown in Figure 7 for subject aa. It can be seen that the highest values are locate feature indexes 8, 12, and 20, which is consistent with Figure 6. It is concluded that features of higher MIBIF values contain more discriminatory information for accuracy provement of MI EEG. In addition, the significance (MIBIF value) changes along the t axis and frequency bands due to the non-stationarity of EEG. The most significant featu are located in some local time-frequency segments. Therefore, it is believed that dec posing a multi-channel EEG into time-frequency segments for more precise analysis h to improve the classification accuracy. Figure 8 depicts distributions of the most sign cant two features derived by CSP, CiSSA + CSP, Subtime + CSP, and Subtime + CiSS CSP, for subject aa. It is indicated that when the spectral (CiSSA + CSP) or temporal (S time + CSP) information is combined into the CSP features, more separable feature di butions are provided in comparison with the standard CSP. The highest discriminab of features was achieved by Subtime + CiSSA + CSP, which combines both the spec and temporal information. The classification accuracy of the two most significant featu derived by Subtime + CiSSA + CSP is 85.0%, 10.7% higher than the classification accur obtained by standard CSP.
The classification accuracy of the proposed mothed varies with the number of features selected by the MIBIF. To select the most suitable features, the classification curacies for the number of selected features by MIBIF were calculated, as shown in Fig  9 for subject av. Figure 9 indicates that the highest classification accuracy (85.7%) is tained when the most significant 25 features are selected for subject av. The highest c sification accuracies and the number of selected features by MIBIF for all subjects shown in Table 5. The average highest classification accuracy with feature selection MIBIF for all subjects is 96.3%, which is a 1.4% improvement compared to the aver classification accuracy without feature selection.  The classification accuracy of the proposed mothed varies with the number of the features selected by the MIBIF. To select the most suitable features, the classification accuracies for the number of selected features by MIBIF were calculated, as shown in Figure 9 for subject av. Figure 9 indicates that the highest classification accuracy (85.7%) is attained when the most significant 25 features are selected for subject av. The highest classification accuracies and the number of selected features by MIBIF for all subjects are shown in Table 5. The average highest classification accuracy with feature selection by MIBIF for all subjects is 96.3%, which is a 1.4% improvement compared to the average classification accuracy without feature selection.

The Effect of Dimensionality Reduction by PCA
We note from Table 1 that the classification accuracies are increased when features are dimensionally reduced by PCA for certain subjects (av, aw, and ay). The receiver operating characteristic (ROC) curves related to the 57 features selected by MIBIF and five features selected by PCA for subject aa are given in Figure 10a. It is indicated that the area under the PCA curve is greater than the areas under curves of selected MIBIF features and all the original features, which means more discrimination in the selected features by PCA than by MIBIF. Figure 10b shows the distribution of the first two features obtained by PCA for subject aa. Note that the right-hand and -foot imagery classes are nearly linearly separable with the top two features with PCA. The classification accuracy of the first two features derived by PCA is 93.6%, higher than the classification accuracy derived by MIBIF (shown in Figure 8).
Similar to MIBIF, the classification accuracy also varies with the feature dimension selected by the PCA. It is indicated by Figure 9 that the highest classification accuracy (87.9%) is attained when the most significant 12 features derived by PCA are selected for    Table 1 that the classification accuraci are dimensionally reduced by PCA for certain subjects (a  Table 1 that the classification accuracies are increased when features are dimensionally reduced by PCA for certain subjects (av, aw, and ay). The receiver operating characteristic (ROC) curves related to the 57 features selected by MIBIF and five features selected by PCA for subject aa are given in Figure 10a. It is indicated that the area under the PCA curve is greater than the areas under curves of selected MIBIF features and all the original features, which means more discrimination in the selected features by PCA than by MIBIF. Figure 10b shows the distribution of the first two features obtained by PCA for subject aa. Note that the right-hand and -foot imagery classes are nearly linearly separable with the top two features with PCA. The classification accuracy of the first two features derived by PCA is 93.6%, higher than the classification accuracy derived by MIBIF (shown in Figure 8).
number of top significant features is selected. This is because PCA can decorrelate the features and reduce the redundant information between features, while MIBIF extracts features most relevant to the class. Features extracted by MIBIF may be highly correlated and contain redundant information. Figure 11 shows the distribution of mutual information between top 25 features selected by MIBIF and PCA for subject av. A higher value of mutual information means more relevance between two features. It can be seen that the top features extracted by MIBIF are highly correlated, while the features extracted by PCA are not correlated. The highest classification accuracies and the selected feature dimension by PCA for all subjects are shown in Table 5. The average highest classification accuracy with dimensionality reduction by PCA for all subjects is 96.6%, 0.3% higher than the accuracy derived by MIBIF.    Similar to MIBIF, the classification accuracy also varies with the feature dimension selected by the PCA. It is indicated by Figure 9 that the highest classification accuracy (87.9%) is attained when the most significant 12 features derived by PCA are selected for subject av. The classification accuracy of PCA is higher than that of MIBIF when the same number of top significant features is selected. This is because PCA can decorrelate the features and reduce the redundant information between features, while MIBIF extracts features most relevant to the class. Features extracted by MIBIF may be highly correlated and contain redundant information. Figure 11 shows the distribution of mutual information between top 25 features selected by MIBIF and PCA for subject av. A higher value of mutual information means more relevance between two features. It can be seen that the top features extracted by MIBIF are highly correlated, while the features extracted by PCA are not correlated. The highest classification accuracies and the selected feature dimension by PCA for all subjects are shown in Table 5. The average highest classification accuracy with dimensionality reduction by PCA for all subjects is 96.6%, 0.3% higher than the accuracy derived by MIBIF.  The distribution of the first two features obtained by PCA for subject aa. Note that the right-hand (blue, circle) and right-foot (red, cross) imagery classes are nearly linearly separable with only 2 features.

Comparison with Other Competing Techniques
Since accuracy is the key criterion for evaluating the performance of methods in a BCI system, we compared the classification accuracy of the proposed CiSSA-CSP method with other competing methods. Table 6 provides a comparative study of the classification performance between the proposed method and ten recently reported methods for Competition III dataset IVa, namely, FBCSP [14], CTFSP [6], Fusion [18], TWFBCSP-MVO [24], SFBCSP [16], STFSCSP [39], DFBCSP [40], CC-LR [37], ISSPL [41], and Class Separability (CS) [35] methods. The highest classification accuracies among these methods are highlighted in bold font for each subject and their averages. The highest classification accuracy of our proposed method is 100% for subject aw. Furthermore, the classification accuracies of our proposed method for subjects aa, al, and ay are very close to the highest classification accuracy of other competing methods. The average classification accuracies of our proposed method are 96.3% and 96.6% for MIBIF and PCA feature selection, respectively, which are higher than the average classification accuracies of other methods. It can be concluded that the proposed method outperforms the recently reported competing methods for MI EEG classification.

Computational Complexity
In order to investigate the computational complexity of the proposed method, we calculated the time consumption of training and testing phase on Competition III dataset IVa. The experiment was implemented using MATLAB R2014a on a PC with Intel(R) Core(TM) 2.40 GHz CPU and 8.0 GB RAM. Figure 12 shows the computational time of the training phase taken by different methods with 10-fold cross-validation. From Figure 12a, it can be seen that combining spectral and temporal information in the CSP features by sub-band filtering (CiSSA + CSP) and time segmentation (Subtime + CiSSA + CSP) takes much more time than the CSP method. In addition, the time required by the MIBIF feature selection method is much longer than PCA. Furthermore, we compared the time consumption of CiSSA and other common filtering methods, as shown in Figure 12b. The results indicate that CiSSA and FIR consume the least time, while CiSSA achieves the highest classification accuracy (shown in Table 2). After training, the optimal CSP filter for each time-frequency segment, the indexes of selected features, and the SVM model can be directly used for testing. Hence the computational time is significantly reduced during the testing phase. Table 7 lists the average testing time of one trial using our method and other recently reported methods for subject aa. The results indicate that, for one test trial, the average execution time of our method is 156.4 ms (MIBIF) or 156.7 ms (PCA). Although our method takes a longer time to compute one trial than other competing methods, it can meet the requirement of real-time processing since the computational time is much less than the length of one trial (3.5 s). Therefore, the proposed method improves the motor imagery classification performance without degrading the computation efficiency for BCI applications.

Results and Discussion of Experimental EEG Dataset
In the study of the experimental EEG dataset, the 9-channel EEG signals of all trials were segmented into T = 4 epochs with overlapping time of 1s (0-2 s, 1-3 s, 2-4 s, 3-5 s). Table 8 shows the classification accuracies of different algorithms for twenty subjects using 10-fold cross-validation. The classification performance of CSP is poor for most subjects and the average accuracy of CSP is 74.7%. When spectral information is combined with the CSP features by decomposing the EEG into sub-bands using CiSSA, the classification performance improves compared to CSP for all subjects. The average classification accuracy of CiSSA + CSP is 90.4%. The average accuracy of Subtime + CiSSA + CSP further After training, the optimal CSP filter for each time-frequency segment, the indexes of selected features, and the SVM model can be directly used for testing. Hence the computational time is significantly reduced during the testing phase. Table 7 lists the average testing time of one trial using our method and other recently reported methods for subject aa. The results indicate that, for one test trial, the average execution time of our method is 156.4 ms (MIBIF) or 156.7 ms (PCA). Although our method takes a longer time to compute one trial than other competing methods, it can meet the requirement of real-time processing since the computational time is much less than the length of one trial (3.5 s). Therefore, the proposed method improves the motor imagery classification performance without degrading the computation efficiency for BCI applications.

Results and Discussion of Experimental EEG Dataset
In the study of the experimental EEG dataset, the 9-channel EEG signals of all trials were segmented into T = 4 epochs with overlapping time of 1s (0-2 s, 1-3 s, 2-4 s, 3-5 s). Table 8 shows the classification accuracies of different algorithms for twenty subjects using 10-fold cross-validation. The classification performance of CSP is poor for most subjects and the average accuracy of CSP is 74.7%. When spectral information is combined with the CSP features by decomposing the EEG into sub-bands using CiSSA, the classification performance improves compared to CSP for all subjects. The average classification accuracy of CiSSA + CSP is 90.4%. The average accuracy of Subtime + CiSSA + CSP further increases to 92.3%. Similar to the public available dataset, k = 9 optimal features were selected for all subjects to preliminarily study the effects of MIBIF and PCA. When MIBIF is used as the feature selection method after Subtime + CiSSA + CSP, the classification accuracy decreases for most subjects and the average classification accuracy decreases to 89.8%, indicating that nine optimal features are not enough to carry sufficient discriminative information. When PCA is used for dimensionality reduction after Subtime + CiSSA + CSP, the classification accuracy increases slightly with all subjects, except for subjects S1, S12, and S14. Subtime + CiSSA + CSP + PCA provides the best results with an average classification accuracy of 93.9%. The results of the experimental EEG dataset are consistent with the results of Competition III dataset IVa. It is concluded that the proposed CiSSA-CSP method can be used in different MI datasets, which verifies the universal applicability of the method. To verify the reliability of the experimental results, a paired t-test [27] is used between two adjacent methods in Tables 1 and 8 to show the statistical difference in the classification accuracies of different methods. The paired t-test's results on all subjects of public and experimental datasets are shown in Table 9. It can be seen that all the p-values are less than 0.05 (p < 0.05), which means all improvements are statistically significant.  Paired t-test is used between two adjacent methods. For example 0.0000 is the paired t-test result between CSP and CiSSA + CSP methods and 0.0018 is the paired t-test result between Subtime + CiSSA + CSP and CiSSA + CSP methods.
The classification accuracy of the proposed mothed on the experimental EEG dataset varies with the number of features selected by MIBIF or PCA. To select the most suitable features, the classification accuracies over the number of selected features by MIBIF or PCA were calculated, and we chose the number of features having the highest accuracy. Table 10 shows the highest classification accuracies and the selected feature dimension by MIBIF and PCA. It can be seen that, for all subjects except for subject S11, the number of features selected by PCA is smaller than that by MIBIF, while the classification accuracy derived by PCA is higher than that of MIBIF. The average highest classification accuracy with dimensionality reduction by PCA for all subjects is 95.2%, 1.5% higher than the accuracy derived by MIBIF.

Conclusions
We propose a novel algorithm, CiSSA-CSP, for learning the optimal time-frequencyspatial patterns to improve classification accuracy of MI EEG. Specifically, raw EEG data are first segmented into multiple time segments using a sliding window. Spectrum-specific sub-bands are further derived for each time segment in a set of non-overlapping filter bands using CiSSA. Therefore, features extracted in all time-frequency segments using CSP combine more sufficient and discriminative time-frequency-spatial information. We then devised a feature fusion based on mutual information or PCA to extract robust and optimal CSP features. A linear SVM classifier was trained on the optimized EEG features to accurately identify the MI tasks. The experimental study implemented on the public and experimental EEG datasets validated the effectiveness of the CiSSA-CSP method. Compared with several other competing methods, the proposed CiSSA-CSP method leads to a superior classification accuracy (averaged classification accuracies were 96.6% and 95.2% for the public and experimental datasets, respectively), which confirms that it is a promising method for improving the performance of MI-based BCIs.