The CSP-Based New Features Plus Non-Convex Log Sparse Feature Selection for Motor Imagery EEG Classification

The common spatial pattern (CSP) is a very effective feature extraction method in motor imagery based brain computer interface (BCI), but its performance depends on the selection of the optimal frequency band. Although a lot of research works have been proposed to improve CSP, most of these works have the problems of large computation costs and long feature extraction time. To this end, three new feature extraction methods based on CSP and a new feature selection method based on non-convex log regularization are proposed in this paper. Firstly, EEG signals are spatially filtered by CSP, and then three new feature extraction methods are proposed. We called them CSP-wavelet, CSP-WPD and CSP-FB, respectively. For CSP-Wavelet and CSP-WPD, the discrete wavelet transform (DWT) or wavelet packet decomposition (WPD) is used to decompose the spatially filtered signals, and then the energy and standard deviation of the wavelet coefficients are extracted as features. For CSP-FB, the spatially filtered signals are filtered into multiple bands by a filter bank (FB), and then the logarithm of variances of each band are extracted as features. Secondly, a sparse optimization method regularized with a non-convex log function is proposed for the feature selection, which we called LOG, and an optimization algorithm for LOG is given. Finally, ensemble learning is used for secondary feature selection and classification model construction. Combing feature extraction and feature selection methods, a total of three new EEG decoding methods are obtained, namely CSP-Wavelet+LOG, CSP-WPD+LOG, and CSP-FB+LOG. Four public motor imagery datasets are used to verify the performance of the proposed methods. Compared to existing methods, the proposed methods achieved the highest average classification accuracy of 88.86, 83.40, 81.53, and 80.83 in datasets 1–4, respectively. The feature extraction time of CSP-FB is the shortest. The experimental results show that the proposed methods can effectively improve the classification accuracy and reduce the feature extraction time. With comprehensive consideration of classification accuracy and feature extraction time, CSP-FB+LOG has the best performance and can be used for the real-time BCI system.

are convex optimization models. Although they have achieved good results, many applications have shown that non-convex sparse optimization methods can obtain better performance [49]. For example, LASSO has a bias problem, which would result in significantly biased estimates, and cannot achieve reliable recovery with the least observations [50].
Aiming to resolve the problem of large calculation and time-consumption of Wavelet-CSP [13,14], WPD-CSP [15,16], and FBCSP [11] methods, we have proposed three new feature extraction methods, namely CSP-Wavelet, CSP-WPD, and CSP-FB method. Firstly, the original EEG signals are pre-processed, including time window selection and band-pass filtering. Then, CSP transform is performed. For CSP-Wavelet, discrete wavelet transform (DWT) is used to decompose the spatially filtered signals, and then the energy and standard deviation of the wavelet coefficients are extracted as features. For CSP-WPD, the wavelet packet decomposition (WPD) is used to decompose the spatially filtered signals. Similar to CSP-Wavelet, the energy and standard deviation of the wavelet coefficients are extracted as features. For CSP-FB, the spatially filtered signals are filtered into multiple frequency bands by a filter bank (FB), and then the logarithm of variances of each band are extracted as features. In order to solve the bias problem of LASSO, a new feature selection method is proposed. A non-convex function is used to sparsely constrain feature weights. Since the non-convex function is a log function, we call this method LOG. In addition, in order to further optimize feature selection and enhance the robustness of the classification model, an ensemble learning method is proposed for secondary feature selection and the construction of multiple classification models. Fisher linear discriminant analysis (FLDA) is used for classification. Combining feature extraction with feature selection methods, we obtained three EEG signals decoding methods, namely CSP-Wavelet+LOG, CSP-WPD+LOG and CSP-FB+LOG. Experimental results show that the classification performances of three newly proposed methods are better than CSP, Wavelet-CSP, WPD-CSP, SFBCSP and SBLFB methods. In terms of feature extraction time, the proposed methods are much less than Wavelet-CSP, WPD-CSP, SFBCSP, and SBLFB methods.
The main contributions of this paper include three aspects. Firstly, we proposed three new feature selection methods based on CSP. These three methods can effectively improve the classification performance of CSP while reducing the feature extraction time. Secondly, we propose a new feature selection method. This method is a non-convex sparse optimization method, which can effectively solve the bias problem of LASSO and select more discriminative features. Thirdly, we use ensemble learning for secondary feature selection and classification model construction, which makes the EEG decoding method more robust and stable.
The content of this paper is organized as follows. Section 2 introduces experimental data, traditional CSP feature extraction method, three new feature extraction methods, a new feature selection method, and secondary feature selection and classification model construction using ensemble learning. The experimental results are showed in Section 3. Section 4 further discusses and analyzes the experimental results. The conclusion is provided in Section 5.

EEG Data Description
Four public motor imagery EEG datasets are briefly described as follow. For detailed information, please refer to related literature or website.
Dataset 1: data set I of BCI competition IV (2008) [51]. This dataset has a total of 59 channels with a sampling rate of 100 Hz. There are three types of motor imagery tasks, including left hand, right hand, and right foot. Seven subjects (1a, 1b, 1c, 1d, 1e, 1f, 1g) selected two of them to be performed. In this paper, the calibration data of this dataset are used for classification, which include two runs with 100 single trials for each run. The first run was selected as the training set and the second run was selected as the test set. For detailed information, please refer to the following website: http://www.bbci.de/competition/IV/.
Sensors 2020, 20, 4749 5 of 29 Dataset 2: data set IIa of BCI competition IV (2008) [52]. This dataset has a total of 22 channels with a sampling rate of 250Hz. Nine subjects (A01, A02, . . . , and A09) performed four types of motor imagery tasks, including left hand, right hand, foot, and tongue. There are two sessions in this dataset, and each session was consisted of 6 runs with 48 trials (12 trials for each class) for each run. In this paper, the first session was selected as the training set and the second session was selected as the test set. The training and test sets of each subject were 72 trials. According to the practice of reference [53], four types of tasks are arranged and combined to obtain multiple binary classification problems, that is, C 2 4 = 6 groups of binary classification. Therefore, a total of 9 × 6 = 54 data subsets are obtained. The left hand, right hand, foot, and tongue motor imagery tasks were represented by letters L, R, F and T, respectively. A01T -LR indicated that the subject A01T performed left hand and right hand motor imagery tasks. For additional information, please refer to the following website: http://www.bbci.de/competition/IV/.
Dataset 3: data set IIb of BCI competition IV (2008) [24]. This dataset has a total of 3 channels with a sampling rate of 250Hz. Nine subjects (B01, B02, . . . , and B09) performed two types of motor imagery tasks, including left hand and right hand. There are five sessions in this dataset. However, only the third training session (B0103T, B0203T, . . . , B0903T) is used in this paper [24]. This session is consisted of 160 trials, and half for each class. 80 trials are used for training set, and the other 80 trials are used for test set. For additional information, please refer to the following website: http://www.bbci.de/competition/IV/.
Dataset 4: data set provided by David Steyrl (2016) [54]. This dataset has a total of 15 channels with a sampling rate of 512 Hz. Fourteen subjects performed two types of motor imagery tasks, including right hand and foot. The data of each subject were divided into two parts. The first part (runs 1-5) was used for train set, whereas the second part (runs 6-8) was used for test set, and each run was consisted of 20 trials (10 trials for each class). Therefore, the training and test sets are 100 and 60 trials, respectively. The original signals are downsampled with a sampling rate of 256 Hz. For more information, please refer to the following website: http://bnci-horizon-2020.eu/database/data-sets.
All datasets are scalp EEG signals, which are recorded by multiple electrode sensors placed on the scalp. Figure 1 shows the distribution of electrodes on the scalp for the four datasets. We focus on the signal processing and pattern recognition of electrode sensor signals in this paper.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 26 paper, the first session was selected as the training set and the second session was selected as the test set. The training and test sets of each subject were 72 trials. According to the practice of reference [53], four types of tasks are arranged and combined to obtain multiple binary classification problems, that is, 2 4 6 C = groups of binary classification. Therefore, a total of 9 × 6 = 54 data subsets are obtained.
The left hand, right hand, foot, and tongue motor imagery tasks were represented by letters L, R, F and T, respectively. A01T -LR indicated that the subject A01T performed left hand and right hand motor imagery tasks. For additional information, please refer to the following website: http://www.bbci.de/competition/IV/.
Dataset 3: data set IIb of BCI competition IV (2008) [24]. This dataset has a total of 3 channels with a sampling rate of 250Hz. Nine subjects (B01, B02, …, and B09) performed two types of motor imagery tasks, including left hand and right hand. There are five sessions in this dataset. However, only the third training session (B0103T, B0203T, …, B0903T) is used in this paper [24]. This session is consisted of 160 trials, and half for each class. 80 trials are used for training set, and the other 80 trials are used for test set. For additional information, please refer to the following website: http://www.bbci.de/competition/IV/.
Dataset 4: data set provided by David Steyrl (2016) [54]. This dataset has a total of 15 channels with a sampling rate of 512 Hz. Fourteen subjects performed two types of motor imagery tasks, including right hand and foot. The data of each subject were divided into two parts. The first part (runs 1-5) was used for train set, whereas the second part (runs 6-8) was used for test set, and each run was consisted of 20 trials (10 trials for each class). Therefore, the training and test sets are 100 and 60 trials, respectively. The original signals are downsampled with a sampling rate of 256 Hz. For more information, please refer to the following website: http://bnci-horizon-2020.eu/database/datasets.
All datasets are scalp EEG signals, which are recorded by multiple electrode sensors placed on the scalp. Figure 1 shows the distribution of electrodes on the scalp for the four datasets. We focus on the signal processing and pattern recognition of electrode sensor signals in this paper.  Figure 2 is a flowchart of the overall processing of the proposed method. It mainly includes preprocessing, CSP transformation, feature extraction, feature selection, and classification. Each part will be discussed in detail in the following content.   Figure 2 is a flowchart of the overall processing of the proposed method. It mainly includes preprocessing, CSP transformation, feature extraction, feature selection, and classification. Each part will be discussed in detail in the following content.

Data Preprocessing
(1) A 6th order Butterworth filter is used to perform 8-30 Hz band-pass filtering on the EEG signals of each channel, which filters out the EEG components that are not related to motor imagery. Butterworth filters are often used in EEG band-pass filtering [53], we are consistent with the practice of most literatures. Motor imagery can cause ERS and ERD phenomena, that is, power changes in specific frequency bands of EEG signals, specifically mu (8)(9)(10)(11)(12) and beta (18)(19)(20)(21)(22)(23)(24)(25)(26) rhythm [2]. Therefore, the 8-30 Hz band-pass filter is usually used to filter the motor imagery signals [55]. (2) Extracting single trial data. The time window of dataset 1 is 0.5-3.5 s, and the other datasets are 0.5-2.5 s, where 0 s is the time when the motor imagery task starts. The time window of datasets 2-4 is different from that of dataset 1. This is because the sampling rate of datasets 2-4 is relatively high. Choosing the time window from 0.5 s to 2.5 s can reduce the amount of data and thus reduce the amount of calculation.  Figure 2 is a flowchart of the overall processing of the proposed method. It mainly includes preprocessing, CSP transformation, feature extraction, feature selection, and classification. Each part will be discussed in detail in the following content.

CSP Transformation
For the binary classification problem, CSP looks for a set of spatial filters to maximize the variance of the band-pass filtered EEG signals in one class while minimizing the other class. The spatial filter w is calculated by simultaneous diagonalization of sample covariance matrix from both classes, details as follows: where T denotes transpose.C 1 andC 2 represent the average covariance matrix of two types of tasks, respectively, which are defined as follows: where trace(·) denotes the solution of the matrix trace. N k represents the number of samples of the kth task, that is, the number of single trial data. D (k,n) ∈ R C×K represents the nth trial data of the kth task, where C represents the total number of EEG signal channels, and K represents the number of samples of each channel. Formula (1) can be transformed into the following generalized eigenvalue problem [55].
The spatial filters are then the eigenvectors of M =C −1 2C1 . The M is arranged in descending order of eigenvalues to obtainM, and the feature vectors of the first m columns and the last m columns ofM are usually taken as the final spatial filter, which is denoted as W. In all experiments in this paper, m is set to 3. For single trial data D, its spatial projection signal is: Sensors 2020, 20, 4749 7 of 29 The traditional CSP feature extraction method extracts the logarithm of the variances of spatially filtered signals as features, details as follows: where var(·) represents the solution of the variance. Finally, the feature vector of the single trial data can be obtained by calculating the formula (5), that is x = [ f 1 , f 2 , · · · , f 2m ].

CSP-Wavelet
After spatial filtering, we can get Z in Section 2.4.1. Discrete wavelet transform is performed on each channel of Z. The derivation of the wavelet decomposition formula is described in detail in [14], interested readers can refer to literature [14]. After the wavelet decomposition, the frequency sub-bands related to motor imagery are selected, and the energy and standard deviation of the wavelet coefficients of the selected sub-bands are extracted as features. The wavelet base is db4. In order to select sub-bands related to motor imagery, we need to combine the sampling rate of the dataset to select the appropriate number of wavelet decomposition layers. For dataset 1 (the sampling rate is 100 Hz), the number of decomposition layers is 3 in this paper. For datasets 2-4 (the sampling rate is 250 and 256 Hz, respectively), the number of decomposition layers is 4. The selection of the number of decomposition layers will be discussed in detail in the discussion section. Figure 3 shows the process of wavelet decomposition with different sampling rates. The sampling rate of 256 Hz and 250 Hz are very close, so we only consider the decomposition process of 250 Hz, and the selected sub-bands of 256 Hz and 250 Hz are the same. Wavelet coefficients with sub-band frequencies in the range of 8-30 Hz are selected and used for feature extraction. The selected sub-bands are marked by the red dotted frame in Figure 3.
The energy of the wavelet coefficients of the selected sub-bands is calculated as follows: where B represents the number of selected sub-bands, N represents the number of the wavelet coefficients, and D ij represents the jth wavelet coefficient of the ith sub-band. The standard deviation of the wavelet coefficients of the selected sub-bands is calculated as follows: the meaning of B, N and D ij is consistent with formula (6) and µ i represents the average value of the wavelet coefficients of the ith sub-band. Finally, we can get the feature vector for the CSP-Wavelet feature extraction method as follows: ; · · · · · · ; e 2m 1 , s 2m 1 , e 2m 2 , s 2m 2 , · · · , e 2m B , s 2m where e c i and s c i represent energy and standard deviation of the ith sub-band of the cth channel of Z, respectively. number of decomposition layers will be discussed in detail in the discussion section. Figure 3 shows the process of wavelet decomposition with different sampling rates. The sampling rate of 256 Hz and 250 Hz are very close, so we only consider the decomposition process of 250 Hz, and the selected subbands of 256 Hz and 250 Hz are the same. Wavelet coefficients with sub-band frequencies in the range of 8-30 Hz are selected and used for feature extraction. The selected sub-bands are marked by the red dotted frame in Figure 3. The energy of the wavelet coefficients of the selected sub-bands is calculated as follows:

CSP-WPD
Similar to CSP-Wavelet, Wavelet packet decomposition is performed on each channel of Z. The derivation of the wavelet packet decomposition formula is described in detail in [56], interested readers can refer to literature [56]. The energy and standard deviation of the wavelet coefficients of the selected sub-bands are extracted as features. The wavelet base and the number of decomposition layers are the same as CSP-Wavelet. Figure 4 shows the process of wavelet packet decomposition with different sampling rates. Wavelet coefficients with sub-band frequencies in the range of 8-30 Hz are selected and used for feature extraction. Similar to CSP-Wavelet, the selected sub-bands of 256 Hz and 250 Hz are the same. The selected sub-bands are marked by the red dotted frame in Figure 4.
where B represents the number of selected sub-bands, N represents the number of the wavelet coefficients, and ij D represents the j th wavelet coefficient of the i th sub-band.
The standard deviation of the wavelet coefficients of the selected sub-bands is calculated as follows: ( ) the meaning of B , N and ij D is consistent with formula (6) and i μ represents the average value of the wavelet coefficients of the i th sub-band. Finally, we can get the feature vector for the CSP-Wavelet feature extraction method as follows:

CSP-WPD
Similar to CSP-Wavelet, Wavelet packet decomposition is performed on each channel of Z . The derivation of the wavelet packet decomposition formula is described in detail in [56], interested readers can refer to literature [56]. The energy and standard deviation of the wavelet coefficients of the selected sub-bands are extracted as features. The wavelet base and the number of decomposition layers are the same as CSP-Wavelet. Figure 4 shows the process of wavelet packet decomposition with different sampling rates. Wavelet coefficients with sub-band frequencies in the range of 8-30 Hz are selected and used for feature extraction. Similar to CSP-Wavelet, the selected sub-bands of 256 Hz and 250 Hz are the same. The selected sub-bands are marked by the red dotted frame in Figure 4. The calculation of the energy and standard deviation are the same as CSP-Wavelet, so the final feature vector form is similar, details as follow: where B ′ represents the number of the selected sub-bands of CSP-WPD. The calculation of the energy and standard deviation are the same as CSP-Wavelet, so the final feature vector form is similar, details as follow: ; · · · · · · ; e 2m 1 , s 2m 1 , e 2m 2 , s 2m 2 , · · · , e 2m B , s 2m where B represents the number of the selected sub-bands of CSP-WPD.

CSP-FB
After spatial filtering, the signals of each channel of Z are filtered into 10 sub-bands with bandwidth of 4 Hz and the overlap rate of 2 Hz in the range of 8-30 Hz. A 6th order Butterworth filter is used. Then, the logarithm of the variances of each sub-band are extracted as features. So, the final feature vector is where B represents the number of the filter bands.

Feature Selection
After feature extracted, we can get a sample feature matrix represents the ith feature sample (feature vector). According to different feature extraction methods, The feature vector obtained by the feature extraction method usually contains redundant information. The redundant features not only increase the complexity of the classification model and model training time, but also easily lead to overfitting. Therefore, feature selection is required to remove redundant features and improve the classification accuracy. LASSO [57] is often used for feature selection, and its mathematical model is as follows: where λ > 0 is the regularization parameter, w is feature weight, w 1 = P i=1 |w i |, and w i is the ith element of w, y = (y 1 , y 2 , . . . , y N ) T are the sample labels and y i ∈ {−1, 1}. Although LASSO is widely used and works well, features selected by LASSO are usually too sparse and scattered throughout the feature space. When the feature dimension is much larger than the sample size, which is very common in EEG signal decoding, the selected results are unstable [58]. In addition, LASSO has a bias problem, which would result in significantly biased estimates, and cannot achieve reliable recovery with the least observations [59]. In order to ameliorate the bias problem of LASSO, we proposed a non-convex sparse optimization method for feature selection. The mathematical model can be described as follows: where a > 0 is the scale parameters, a is set to 0.001 in this paper. This concave LOG function has the better ability to encourage the sparsity than l 1 -norm and penalizes all elements non-uniformly [60]. In order to solve the minimization problem (12), many efficient algorithms are proposed, such as proximal algorithms [61] or the alternating direction of multipliers method [62]. The proximal splitting algorithms including iterative shrinkage thresholding [63] are popular methods for solving (11) and (12).
Using proximal gradient methods has many advantages than the other methods. They can deal with general conditions, for functions which are non-convex, or non-smooth and convex. Those algorithms have simple forms and it is easy to derive and implement. In particular, they can be used in large scale problems. So, in this paper we use the iterative log thresholding [64] to solve (12). It has only two basic steps which are iterated until convergence: (i) Gradient step. Define an intermediate point v t at the tth step by taking a gradient step with respect to the differentiable term (ii) Proximal operator step. Evaluate the proximal operator of the non-convex log function at the intermediate where γ = X T X 2 and prox γ log (v t ) is the proximal operator of log regularization function, which is defined as From the paper [65], we know that (15) has the explicit solution. Therefore, w t+1 is given by the log function's proximal operator: where sign(v t ) is the algebraic sign of v t . Hence, the detailed iteration steps for solving (12) can be expressed as We can see the iteration scheme (17) is easy to implement and only involved the matrix vector multiplication. Also, every step has a closed form solution. It is suitable for the large scale problems. At last, the convergence of (17) is established in [64], and we refer the interested readers to [64] for more details.

Secondary Feature Selection and Classification Model Construction
In order to select more effective features and construct a more robust classification model, we propose an ensemble learning method for secondary feature selection and the construction of multiple classification models. The overall processing flow is shown in Figure 5, where |w| represents the absolute value of w. In Section 2.5, we can obtain feature weights after feature selection performed by LASSO or LOG method. We further select features by setting a series of weight thresholds. The candidate threshold parameters are: [0, 0.1, ..., 0.8]. The features whose weight is bigger than the set threshold will be selected.
During the training phase, different thresholds will get different feature subsets, and feature subsets are trained to get different classification models. In this paper, we use FLDA as the classifier, so we get multiple FLDA classification models. During the test phase, we use the rules in the training phase to select a subset of features, and then use the classification models obtained in the training phase for classification. Because there are multiple classification models, we can get multiple classification accuracy. We take the maximum of these classification accuracy as the final classification accuracy. It is worth noting that if the feature subset is empty, the classification accuracy is directly set to 0.
surrounding environment and noise during the collection process. The optimal feature subset and classification model selected in the training phase may not be optimal in the test set when they are interfered by noise. Different data samples suffer from different interferences. When classifying, they may get the best classification results in different feature subsets and classification models. We choose the maximum value of multiple classification models as the final accuracy, which has a certain degree of anti-interference and can increase the stability and robustness of the EEG decoding model.

Compared Methods and Parameter Settings
In this paper, we use classification accuracy as the evaluation criterion. The classification accuracy is equal to the number of corrected classifications divided by the total number of test sets. FLDA is used for classification. For all methods, except for SFBCSP and SBLFB methods, the original EEG signals are filtered by 8-30 Hz band-pass filters. The compared methods are listed in Table 1, and the parameter setting will be introduced below.

Methods
Algorithm Composition and Processing Flow Traditional machine learning methods use cross-validation to select the optimal feature subset and classification model in the training phase, and then use the obtained classification model for classification in the testing phase. However, our method trains multiple models and selects the maximum value of them as the final accuracy. This is where our work differs from previous research work. EEG signals have strong randomness and non-stationarity, and are also easily affected by the surrounding environment and noise during the collection process. The optimal feature subset and classification model selected in the training phase may not be optimal in the test set when they are interfered by noise. Different data samples suffer from different interferences. When classifying, they may get the best classification results in different feature subsets and classification models. We choose the maximum value of multiple classification models as the final accuracy, which has a certain degree of anti-interference and can increase the stability and robustness of the EEG decoding model.

Compared Methods and Parameter Settings
In this paper, we use classification accuracy as the evaluation criterion. The classification accuracy is equal to the number of corrected classifications divided by the total number of test sets. FLDA is used for classification. For all methods, except for SFBCSP and SBLFB methods, the original EEG signals are filtered by 8-30 Hz band-pass filters. The compared methods are listed in Table 1, and the parameter setting will be introduced below.

Algorithm Composition and Processing Flow
CSP Band-pass filtered EEG signals are spatially filtered by CSP. The logarithm of the variances of spatially filtered signals are extracted as features [10].

Wavelet-CSP
The EEG signals of each channel are decomposed using DWT, the wavelet base is db4. The number of decomposition layers of dataset 1 is 3, and the other datasets are 4. The sub-bands related to motor imagery are used to reconstruct new channels, and then feature extraction is performed using CSP [13].

WPD-CSP
The EEG signals of each channel are decomposed using WPD, the wavelet base is db4. The number of decomposition layers of dataset 1 is 3, and the other datasets are 4. The sub-bands related to motor imagery are used to reconstruct new channels, and then feature extraction is performed using CSP [15].

SFBCSP
The original EEG signals are filtered into 17 sub-bands, and features are extracted for each sub-band using CSP. The filter bandwidth is 4 Hz and the overlap rate is 2 Hz in the range of 4-40 Hz. The sub-band features are selected by LASSO [23].

SBLFB
The original EEG signals are filtered into 17 sub-bands, and features are extracted for each sub-band using CSP. The filter bandwidth is 4 Hz and the overlap rate is 2 Hz in the range of 4-40 Hz. The sub-band features are selected by sparse Bayesian learning [24].

CSP-Wavelet+LASSO
After band-pass filtering, features are extracted using CSP-Wavelet. Features are selected by LASSO. Ensemble learning is used for secondary feature selection and classification model construction.

CSP-WPD+LASSO
After band-pass filtering, features are extracted using CSP-WPD. Features are selected by LASSO. Ensemble learning is used for secondary feature selection and classification model construction.

CSP-FB+LASSO
After band-pass filtering, features are extracted using CSP-FB. Features are selected by LASSO. Ensemble learning is used for secondary feature selection and classification model construction.

CSP-Wavelet+LOG
After band-pass filtering, features are extracted using CSP-Wavelet. Features are selected by LOG. Ensemble learning is used for secondary feature selection and classification model construction.

CSP-WPD+LOG
After band-pass filtering, features are extracted using CSP-WPD. Features are selected by LOG. Ensemble learning is used for secondary feature selection and classification model construction.

CSP-FB+LOG
After band-pass filtering, features are extracted using CSP-FB. Features are selected by LOG. Ensemble learning is used for secondary feature selection and classification model construction.
The regularization parameters of LASSO and LOG are selected using 10-fold cross-validation and grid search method. The alternative set of hyperparameters is: λ ∈ [2 −5 , 2 −4.8 , . . . 2 4.8 , 2 5 . The LASSO was implemented using the SLEP toolbox [66]. Table 2 shows the classification accuracy of various subjects in dataset 1 for each method. Except for CSP, the three proposed methods are significantly better than the compared methods. The CSP-FB+LOG method has achieved the highest average classification accuracy among the three proposed methods, and the highest classification accuracy in multiple subjects. Four CSP improvement methods, including Wavelet-CSP, WPD-CSP, SFBCSP, and SBLFB, have lower average accuracy than the traditional CSP.

Experimental Results and Analysis
There are 6 types of binary classification tasks in dataset 2 and a total of 54 subject-tasks. Table 3 shows the classification accuracy of various subjects in dataset 2 for each method. The three proposed methods are significantly better than the compared methods. The CSP-Wavelet+LOG method has achieved the highest average classification accuracy, followed by CSP-WPD+LOG and CSP-FB+LOG. The Wavelet-CSP and WPD-CSP methods are slightly better than CSP, but the SFBCSP and SBLFB methods are lower than CSP. Table 4 shows the classification accuracy of various subjects in dataset 3 for each method. The CSP-FB+LOG methods are significantly better than the compared methods. CSP-WPD+LOG and CSP achieved the same average classification accuracy, and CSP-Wavelet+LOG was slightly lower than CSP. Other methods are lower than CSP. Table 5 shows the classification accuracy of various subjects in dataset 4 for each method. The three proposed methods are significantly better than the compared methods. The CSP-WPD+LOG method has achieved the highest average classification accuracy among the three proposed methods, and the highest classification accuracy in multiple subjects. The Wavelet-CSP and WPD-CSP methods are better than CSP, but the SFBCSP and SBLFB methods are still lower than CSP.
In order to better demonstrate the superiority of the proposed methods, Figure 6 shows the classification accuracy of all methods in each subject. The red circle represents the classification accuracy of dataset 1 (seven subjects). The blue box represents the classification accuracy of dataset 2 (54 subjects). The cyan asterisk represents the classification accuracy of dataset 3 (nine subjects). The magenta triangle represents the classification accuracy of dataset 4 (14 subjects). Points above the diagonal indicate that the proposed methods are superior to the compared methods. From Figure 6, it can be seen that most of the points are above the diagonal, illustrating the superiority of the proposed methods.
In order to show the overall classification effect of the proposed methods more intuitively, Figure 7 shows the average classification accuracy of each dataset and the total average classification accuracy of all data. From Figure 7, it can be seen that the proposed methods are significantly better than other methods. For all data, the average classification accuracy and standard deviation obtained by the CSP, Wavelet-CSP, WPD-CSP, SFBCSP, SBLFB, CSP-Wavelet+LOG, CSP-WPD+LOG, and CSP-FB+LOG methods are: 79 13.61, respectively. The CSP-WPD+LOG method achieves the highest average classification accuracy in all data. The CSP-Wavelet+LOG and CSP-FB+LOG are slightly lower than CSP-WPD+LOG. The Wavelet-CSP and WPD-CSP methods are slightly better than CSP. The SFBCSP and SBLFB methods are always lower than CSP in each dataset and all data.
In order to study the effectiveness of secondary feature selection, Tables 6 and 7 show the classification results with secondary feature selection and without secondary feature selection. LASSO and LOG are used for feature selection, respectively. It can be seen from Tables 6 and 7 that no matter which feature extraction method, the second feature selection has achieved better results. Especially for CSP-FB feature extraction method, the overall average classification accuracy is improved by 3.91% for LASSO, and 3.62% for LOG.
Comparing Tables 6 and 7, we can analyze the performance of LASSO and LOG. First, we analyze the situation without secondary feature selection. For all data, except for CSP-WPD feature extraction method, the average classification accuracy of LOG is higher than LASSO. In the situation with secondary feature selection, LOG is better than LASSO. No matter whether secondary feature selection is performed, LOG is better to LASSO for CSP-Wavelet and CSP-FB feature extraction method. In summary, the LOG is superior to LASSO.    accuracy of dataset 1 (seven subjects). The blue box represents the classification accuracy of dataset 2 (54 subjects). The cyan asterisk represents the classification accuracy of dataset 3 (nine subjects). The magenta triangle represents the classification accuracy of dataset 4 (14 subjects). Points above the diagonal indicate that the proposed methods are superior to the compared methods. From Figure 6, it can be seen that most of the points are above the diagonal, illustrating the superiority of the proposed methods. In order to show the overall classification effect of the proposed methods more intuitively, Figure 7 shows the average classification accuracy of each dataset and the total average classification accuracy of all data. From Figure 7, it can be seen that the proposed methods are significantly better than other methods. For all data, the average classification accuracy and standard deviation obtained by the CSP, Wavelet-CSP, WPD-CSP, SFBCSP, SBLFB, CSP-Wavelet+LOG, CSP-WPD+LOG, and CSP-FB+LOG methods are: 79 .61, respectively. The CSP-WPD+LOG method achieves the highest average classification accuracy in all data. The CSP-Wavelet+LOG and CSP-FB+LOG are slightly lower than CSP-WPD+LOG. The Wavelet-CSP and WPD-CSP methods are slightly better than CSP. The SFBCSP and SBLFB methods are always lower than CSP in each dataset and all data.
In order to study the effectiveness of secondary feature selection, Tables 6 and 7 show the classification results with secondary feature selection and without secondary feature selection. LASSO and LOG are used for feature selection, respectively. It can be seen from Tables 6 and 7 that no matter which feature extraction method, the second feature selection has achieved better results. Especially for CSP-FB feature extraction method, the overall average classification accuracy is improved by 3.91% for LASSO, and 3.62% for LOG.
Comparing Tables 6 and 7, we can analyze the performance of LASSO and LOG. First, we analyze the situation without secondary feature selection. For all data, except for CSP-WPD feature In addition to LASSO, our method is also compared with other three feature selection methods, and the corresponding results are showed in Table 8. Fisher score (F-score) [36] combined with FLDA classifier constitutes a wrapped feature selection method. The optimal feature subset is selected using 10-fold cross-validation. Genetic algorithm (GA) and binary particle swarm optimization (BPSO) can be referred to the literature [67], and the parameter settings and classifier of these two methods are consistent with literature [67]. After feature selection, FLDA is used for classification. For data set 1, when CSP-WPD is usd for feature extraction, the average classification accuracy of LOG is slightly lower than that of LASSO. For dataset 2, when CSP-FB is used for feature extraction, the average classification accuracy of LOG is slightly lower than that of LASSO. In other cases, LOG is better than LASSO. It can be seen from Table 8 that LOG has achieved the best classification effect, which is significantly better than other feature selection methods. In addition, F-score is better than GA and BPSO.
In order to study the effect of different classifiers on the performance of LOG, Table 9 shows the classification results of LOG using three classifiers. SVM is implemented using LIBSVM toolbox [68]. SVM uses the linear kernel function, and the parameter settings of SVM are set according to the toolbox default settings. Bayesian linear discriminant analysis (BLDA) [69] is an improvement of FLDA. The BLDA model parameters are automatically estimated from the training data. For CSP-Wavelet+LOG and CSP-WPD+LOG, the average accuracy (for all data) of FLDA is higher than that of SVM and BLDA. For CSP-FB+LOG, the average accuracy of FLDA is almost the same as that of SVM and BLDA. In general, FLDA is better than SVM and BLDA. Therefore, when selecting a classifier, FLDA is a better choice for the proposed methods in this paper.
In order to more comprehensively evaluate the effectiveness of our method, Tables 10-12 respectively show the classification results of our method and other existing methods in the three BCI data sets. Table 10 shows the classification accuracy of the proposed methods and other resent methods for BCI Competition IV Dataset I. It can be seen from Table 10 that CSP-FB+LOG is second only to literature [73]. The average classification accuracy of CSP-WPD+LOG is lower than that of literatures [73,74]. CSP-Wavelet+LOG is not good but not bad either, rather the effect is mediocre. The literature [73] proposed a novel feature extraction method in which the hybrid features of the brain function based on the bilevel network are extracted. Minimum spanning tree (MST) based on EEG signal nodes in different motor imagery is constructed as the first network layer to solve the global network connectivity problem. In addition, the regional network in different movement patterns is constructed as the second network layer to determine the network characteristics, which is consistent with the correspondence between limb movement patterns and cerebral cortex in neurophysiology. Although literature [73] has achieved better results, it is stronger a priori both in terms of frequency bands and EEG electrodes used to perform the classification. Our method does not require any prior information. Table 11 shows the classification accuracy of the proposed methods and other resent methods for BCI Competition IV Dataset IIa. It can be seen from Table 11 that CSP-Wavelet+LOG and CSP-WPD+LOG is second only to literature [80]. CSP-FB+LOG is slightly lower than the literatures [76,77]. Although the literature [80] has achieved good classification results, the literature [80] relies on the data of other subjects. Our method only uses data from the subjects themselves. Therefore, our method is more independent. Table 12 shows the classification accuracy of the proposed methods and other resent methods for BCI Competition IV Dataset IIb. It can be seen from Table 12 that CSP-FB+LOG is second only to literature [30]. CSP-WPD+LOG is lower than the literatures [30,84]. The effect of CSP-Wavelet+LOG is relatively poor. Literature [30] uses EMD time-frequency analysis method for feature extraction and CNN for classification. Compared with literature [30], considering the time of feature extraction and the complexity of the model, our method has certain advantages.
Sensors 2020, 20, x FOR PEER REVIEW 16 of 26 extraction method, the average classification accuracy of LOG is higher than LASSO. In the situation with secondary feature selection, LOG is better than LASSO. No matter whether secondary feature selection is performed, LOG is better to LASSO for CSP-Wavelet and CSP-FB feature extraction method. In summary, the LOG is superior to LASSO.      In summary, although our method has not achieved the best classification accuracy on each data set, they are better than most existing methods. In addition, our methods have certain advantages in feature extraction time and classification model complexity. Our method uses FLDA for classification, and obviously the complexity of the classification model is relatively low. For the feature extraction time, we will introduce it in detail in the discussion section.

Discussion
When CSP-Wavelet and CSP-WPD methods are used for feature extraction, the number of wavelet decomposition layers has a greater impact on classification accuracy. The selection for the number of layers considers two factors, namely the frequency resolution and the decomposition time. For a dataset with a sampling rate of 100 Hz, when the number of decomposition layers is less than or equal to 2, the frequency resolution is too low. It is impossible to correctly distinguish the frequency bands related to motor imagery, which is not conducive to extracting discriminative information. When the number of decomposition layers is greater than or equal to 5, the frequency band resolution is too high, and the extracted features are easily affected by noise. At the same time, the decomposition time also increased significantly. Therefore, for a dataset with a sampling rate of 100 Hz (dataset 1), only the case of the number of layers with 3 and 4 are considered in this paper. Similarly, for a dataset with a sampling rate of 250 Hz or 256 Hz (datasets 2-4), we only consider the case of the number of layers with 4 and 5. Tables 13 and 14 show the classification results of CSP-Wavelet and CSP-WPD methods using different decomposition layers, respectively. We first discuss the CSP-Wavelet method. In Table 13, when the sampling rate of the dataset is 100 Hz, L 1 = 3, and L 2 = 4. When the sampling rate of the dataset is 250 Hz or 256 Hz, L 1 = 4, and L 2 = 5. It can be seen from Table 13 that the classification accuracy of the smaller number of decomposition layers is usually greater than that of the larger number of decomposition layers. Even in some cases, the larger number of decomposition layers get a slightly better classification accuracy, considering the decomposition time, we still choose a smaller number of decomposition layers. In Table 14, for CSP-WPD method, we can obtain similar results. Therefore, for the sampling rate of the dataset is 100 Hz, the number of decomposition layer with 3 is selected in this paper. For the sampling rate of the dataset is 250 Hz or 256 Hz, the number of decomposition layer with 4 is selected. In Tables 13 and 14, we not only studied the influence of the number of decomposition layers on the classification results, but also studied the influence of sub-band selection on the classification results. As it can be seen from Tables 13 and 14, in most cases, sub-band selection helps to improve the classification accuracy. Manually excluding sub-bands that are obviously unrelated to the motor imagery tasks can remove redundant information and reduce noise interference, and also reducing feature dimensions and model complexity. Therefore, sub-band selection can improve classification accuracy. It is worth pointing out that, when selecting sub-bands of CSP-Wavelet, the number of decomposition layers has no effect on the classification accuracy. The reason for this is that, when the number of decomposition layers is greater than or equal to three (100 Hz sampling rate) or greater than or equal to four (250 Hz or 256 Hz sampling rate), the selected sub-band are the same.
LASSO has been widely used in EEG feature selection. However, LASSO is a biased estimation of the l 0 -norm, which regularize the feature weights with l 1 -norm. The feature weights obtained by LASSO deviate from the true value and are too sparse. Non-convex regularization can alleviate the bias problem of l 1 -norm [50]. Therefore, the LOG method proposed in this paper can improve the classification accuracy. To illustrate the problem more intuitively, for subject A01 with motor imagery tasks of left hand vs. right hand, Figure 8 shows the feature weights obtained by LASSO and LOG, where features are extracted by CSP-FB. The lower part of Figure 8 is the feature weight obtained by performing the second feature selection on the feature weights obtained by LASSO and LOG. A total of six channels of signals are retained after CSP filtered. The feature index 1-10 in Figure 8 corresponds to the features of the first channel signal after filtered by 8-12 Hz, 10-14 Hz, ..., 26-30 Hz band-pass filters. The feature index 11-20 corresponds to the features of the second channel signal after filtered by 8-12 Hz, 10-14 Hz, ..., 26-30 Hz band-pass filters. The other feature indexes can be deduced by analogy. It can be seen from Figure 8 that the features selected by the LOG method include the features of the first and second channel signals and the fifth and sixth channel signals, while LASSO only selects the features of the second channel signal and the sixth channel signal. Therefore, the features selected by the LOG method contain more information, which is more discriminative (according to the CSP principle, the signals of the front and back m channel signals are more discriminative).
In summary, the features selected by LASSO are too few, and at the same time the selected features are not discriminative enough.
Sensors 2020, 20, x FOR PEER REVIEW 21 of 26 are more discriminative). In summary, the features selected by LASSO are too few, and at the same time the selected features are not discriminative enough. In the classification results of all datasets, the SFBCSP and SBLFB methods are relatively poor. We use the ensemble learning method proposed in this paper to optimize these two methods, and the results are shown in Figure 9. Although the classification accuracy of the SFBCSP and SBLFB methods are effectively improved, compared with the other methods, the effect is still not good. The SFBCSP and SBLFB methods do not achieve good classification results in this paper, there may be two reasons. On the one hand, the datasets used in this paper are different from the compared methods. The SFBCSP and SBLFB methods may not be applicable to new datasets. On the other hand, although we have tried to restore the SFBCSP and SBLFB methods described by the author, there may be some data processing steps and some details may not be handled properly. It is worth noting that the effect of the algorithms restored in this paper is similar to that in the literature [26]. Specifically, the SFBCSP and SBLFB methods do not perform well on the dataset 1. Table 15 shows the feature extraction time of training set for each method. Three subjects were used for the experiment, namely 1a, A01, and S01. These three subjects were from three different datasets, and these three datasets had different sampling rates and channel number. The feature extraction process of SFBCSP and SBLFB is the same, so the feature extraction time is the same. Comparing the three proposed methods with existing methods (CSP-Wavelet vs. Wavelet-CSP, CSP-WPD vs. WPD-CSP, and CSP-FB vs. SFBCSP), the feature extraction time is significantly reduced. Among the three newly proposed methods, CSP-FB has the least time. Although CSP-FB takes longer time than CSP, it can still be used in real-time BCI.
The three proposed methods in this paper do not consider the selection of time window during feature extraction. The correct selection of time window can effectively improve the classification accuracy, which has been verified in many existing works, such as literature [44,46]. Therefore, in future work, we will consider integrating the selection of time windows into the proposed methods to further improve classification performance. In addition, the feature selection method proposed in this paper uses cross-validation to obtain model parameters. The model training is cumbersome and time-consuming. On the other hand, the parameters obtained by cross-validation are not necessarily optimal, especially in the case of small samples [85]. Implementing LASSO and LOG under the In the classification results of all datasets, the SFBCSP and SBLFB methods are relatively poor. We use the ensemble learning method proposed in this paper to optimize these two methods, and the results are shown in Figure 9. Although the classification accuracy of the SFBCSP and SBLFB methods are effectively improved, compared with the other methods, the effect is still not good. The SFBCSP and SBLFB methods do not achieve good classification results in this paper, there may be two reasons. On the one hand, the datasets used in this paper are different from the compared methods. The SFBCSP and SBLFB methods may not be applicable to new datasets. On the other hand, although we have tried to restore the SFBCSP and SBLFB methods described by the author, there may be some data processing steps and some details may not be handled properly. It is worth noting that the effect of the algorithms restored in this paper is similar to that in the literature [26]. Specifically, the SFBCSP and SBLFB methods do not perform well on the dataset 1. Table 15 shows the feature extraction time of training set for each method. Three subjects were used for the experiment, namely 1a, A01, and S01. These three subjects were from three different datasets, and these three datasets had different sampling rates and channel number. The feature extraction process of SFBCSP and SBLFB is the same, so the feature extraction time is the same. Comparing the three proposed methods with existing methods (CSP-Wavelet vs. Wavelet-CSP, CSP-WPD vs. WPD-CSP, and CSP-FB vs. SFBCSP), the feature extraction time is significantly reduced. Among the three newly proposed methods, CSP-FB has the least time. Although CSP-FB takes longer time than CSP, it can still be used in real-time BCI.
The three proposed methods in this paper do not consider the selection of time window during feature extraction. The correct selection of time window can effectively improve the classification accuracy, which has been verified in many existing works, such as literature [44,46]. Therefore, in future work, we will consider integrating the selection of time windows into the proposed methods to further improve classification performance. In addition, the feature selection method proposed in this paper uses cross-validation to obtain model parameters. The model training is cumbersome and time-consuming. On the other hand, the parameters obtained by cross-validation are not necessarily optimal, especially in the case of small samples [85]. Implementing LASSO and LOG under the Bayesian framework [86] to avoid tedious cross-validation will further improve the performance of the proposed methods. Bayesian framework [86] to avoid tedious cross-validation will further improve the performance of the proposed methods.

Conclusions
In this paper, we have proposed three new feature extraction methods and one feature selection method. FLDA is used for classification. Combining feature extraction and feature selection methods, we can obtain three new EEG decoding methods, namely CSP-Wavelet+LOG, CSP-WPD+LOG, and CSP-FB. The classification performance of the proposed methods is better than the existing methods. CSP-WPD+LOG has achieved the highest total average classification accuracy among the three new methods, but the feature extraction time is the longest. The classification accuracy of CSP-Wavelet+LOG and CSP-FB is slightly lower than CSP-WPD+LOG, but these two methods have a huge time advantage, especially CSP-FB, which can be used in real-time brain-computer interface.
In future work, we will continue to optimize the proposed method. For example, we can improve the selection of the optimal filtering frequency band and time window, and the selection method of

Conclusions
In this paper, we have proposed three new feature extraction methods and one feature selection method. FLDA is used for classification. Combining feature extraction and feature selection methods, we can obtain three new EEG decoding methods, namely CSP-Wavelet+LOG, CSP-WPD+LOG, and CSP-FB. The classification performance of the proposed methods is better than the existing methods. CSP-WPD+LOG has achieved the highest total average classification accuracy among the three new methods, but the feature extraction time is the longest. The classification accuracy of CSP-Wavelet+LOG and CSP-FB is slightly lower than CSP-WPD+LOG, but these two methods have a huge time advantage, especially CSP-FB, which can be used in real-time brain-computer interface.
In future work, we will continue to optimize the proposed method. For example, we can improve the selection of the optimal filtering frequency band and time window, and the selection method of model parameters. In addition, multi-classification expansion will also be part of our future research content.
Author Contributions: S.Z. and Z.Z. contributed to the study concept and design. B.Z. performed algorithm optimization. B.F., T.Y. and Z.L. conducted the collection, analysis, and interpretation of data. S.Z. and Z.Z. worked on the preparation of the manuscript. All authors have read and agreed to the published version of the manuscript.