Efficient Classification of Motor Imagery Electroencephalography Signals Using Deep Learning Methods

Single-trial motor imagery classification is a crucial aspect of brain–computer applications. Therefore, it is necessary to extract and discriminate signal features involving motor imagery movements. Riemannian geometry-based feature extraction methods are effective when designing these types of motor-imagery-based brain–computer interface applications. In the field of information theory, Riemannian geometry is mainly used with covariance matrices. Accordingly, investigations showed that if the method is used after the execution of the filterbank approach, the covariance matrix preserves the frequency and spatial information of the signal. Deep-learning methods are superior when the data availability is abundant and while there is a large number of features. The purpose of this study is to a) show how to use a single deep-learning-based classifier in conjunction with BCI (brain–computer interface) applications with the CSP (common spatial features) and the Riemannian geometry feature extraction methods in BCI applications and to b) describe one of the wrapper feature-selection algorithms, referred to as the particle swarm optimization, in combination with a decision tree algorithm. In this work, the CSP method was used for a multiclass case by using only one classifier. Additionally, a combination of power spectrum density features with covariance matrices mapped onto the tangent space of a Riemannian manifold was used. Furthermore, the particle swarm optimization method was implied to ease the training by penalizing bad features, and the moving windows method was used for augmentation. After empirical study, the convolutional neural network was adopted to classify the pre-processed data. Our proposed method improved the classification accuracy for several subjects that comprised the well-known BCI competition IV 2a dataset.


Introduction
Creating brain-computer interface (BCI) applications based on electroencephalograms (EEG) is a challenging scientific task given that they translate the mental imagery to sets of commands without using any muscles. BCI applications are a valuable part of neuroscience, neural engineering, and medicine, in which robotics or mental issue detectors are used. To date, they have been used extensively in many areas of medicine to help people by connecting their minds to control devices or by detecting brain abnormalities [1][2][3]. This can be recognized by measuring the electric or magnetic fields generated by the central nervous system using electroencephalography (EEG) or magnetoencephalography (MEG) [4]. Electroencephalographic signals are usually recorded with the placement and use of EEG sensors with another kind of electrode placed onto the surface of the scalp using a 10-20 electrode placement system (Jasper 1958).

Filterbank Common Spatial Pattern
The FBCSP [10] is shown in Figure 1. The first stage is the bandpass filtering stage that uses ICA decomposition to reduce noise in the second stage. Moreover, the third stage fitted the CSP and formulated the spatial filter, followed by the application of the spatial filters and transformation of the results into CSP spaces.

Filtering at the Base Frequency
Initially, the signal was divided into filterbanks with the help of IIR filters. The cut-off frequencies of the filters were chosen in the range of 8-36 Hz because better results were obtained in this range. the data augmentation method is presented in Section 3.1, and the filterbank particle swarm optimization (FBPSO) feature-selection algorithm is represented in Section 3.2. Accordingly, datasets and generated results are presented in Section 4, followed by a brief summary of the study.

Filterbank Common Spatial Pattern
The FBCSP [14] is shown in Figure 1. The first stage is the bandpass filtering stage that uses ICA decomposition to reduce noise in the second stage. Moreover, the third stage fitted the CSP and formulated the spatial filter, followed by the application of the spatial filters and transformation of the results into CSP spaces.

Filtering at the Base Frequency
Initially, the signal was divided into filterbanks with the help of IIR filters. The cut-off frequencies of the filters were chosen in the range of 8-36 Hz because better results were obtained in this range. Figure 1. Division of signals into subsignals by filtering followed by the application of spatial filters, as described by [14]. CSP: common spatial pattern.

Common Spatial Patterns for Two Classes
The CSP algorithm was then applied after the blind ICA source separation algorithm was deployed because it constitutes one of the best ways to reduce noise in brain signals. The core idea of CSP is the maximization of one class of features and the minimization of another so that the resulting signals encode the most significant information [15].
If two conditions, a and b, and the respective matrices X a and X b exist, they define an N × T shape, where N is the number of electrodes, and T is the number of samples per electrode. Firstly, we have to find the normalized covariance of the matrices of the trials: where R a i indicates the normalized covariance matrix of the i th trial of a group or a condition "a" (similarly for R b i ). Subsequently, the normalized covariance matrices are averaged according to The summation of these allows the formulation of the composed matrix Rc and the estimation of the eigenvalues and vectors, whereby R c =B c λB c T , and B c is the matrix of eigenvectors. λ is a diagonal matrix of eigenvalues. The whitening transform is computed by W = λ -1/2 B c T . Let Sa and Sb be S a =WR a W T and S b =WR b W T Division of signals into subsignals by filtering followed by the application of spatial filters, as described by [10]. CSP: common spatial pattern.

Common Spatial Patterns for Two Classes
The CSP algorithm was then applied after the blind ICA source separation algorithm was deployed because it constitutes one of the best ways to reduce noise in brain signals. The core idea of CSP is the maximization of one class of features and the minimization of another so that the resulting signals encode the most significant information [14].
If two conditions, a and b, and the respective matrices X a and X b exist, they define an N × T shape, where N is the number of electrodes, and T is the number of samples per electrode. Firstly, we have to find the normalized covariance of the matrices of the trials: where R i a indicates the normalized covariance matrix of the i th trial of a group or a condition "a" (similarly for R i b ). Subsequently, the normalized covariance matrices are averaged according to The summation of these allows the formulation of the composed matrix Rc and the estimation of the eigenvalues and vectors, whereby R c = B c λB T c , and B c is the matrix of eigenvectors. λ is a diagonal matrix of eigenvalues. The whitening transform is computed by W whitening = λ −1/2 B T c . Let S a and S b be S a = WR a W T and S b = WR b W T In the next step, we identify the aforementioned eigen decomposition of these matrices, which are expressed as S a = Uψ a U T and S b = Uψ b U T (4) The eigenvalues of the equations in (4) satisfy this equation ψ a + ψ b = I. Eventually, our spatial filter will be computed according to Thus, this filter can be applied as Our P matrix then satisfies these equations according to the representation listed in [15], which means that the spatial filter P diagonalizes both covariance matrices. After filtering the data, we can save only the important features and discard the less informative ones. To do this, m/2 rows above and m/2 rows below the matrix are selected so that information that represents both conditions is maintained. In the end, an m × T matrix is constructed, where m is the number of selected rows.
In this study, after the completion of the filtering process, we had seven spatially filtered filterbanks with sizes of m × T, where m = 4 and T = 500 in this case. These filterbanks are concatenated, as represented in Figure 2. The axis of zero is used for concatenation so that the resulting matrix has the form of (m × 7) × T; that is, 42 × 500. Accordingly, the computation of the covariance matrix based on this formulation yields a 36 × 36 matrix. In the next step, we identify the aforementioned eigen decomposition of these matrices, which are expressed as S a =Uψ a U T and S b =Uψ b U T (4) The eigenvalues of the equations in (4) satisfy this equation ψ a +ψ b = I. Eventually, our spatial filter will be computed according to Thus, this filter can be applied as Our P matrix then satisfies these equations according to the representation listed in [16], which means that the spatial filter P diagonalizes both covariance matrices.
After filtering the data, we can save only the important features and discard the less informative ones. To do this, m/2 rows above and m/2 rows below the matrix are selected so that information that represents both conditions is maintained. In the end, an m × T matrix is constructed, where m is the number of selected rows.
In this study, after the completion of the filtering process, we had seven spatially filtered filterbanks with sizes of m × T, where m = 4 and T = 500 in this case. These filterbanks are concatenated, as represented in Figure 2. The axis of zero is used for concatenation so that the resulting matrix has the form of (m × 7) × T; that is, 42 × 500. Accordingly, the computation of the covariance matrix based on this formulation yields a 36 × 36 matrix.

Multiclass Filterbank Common Spatial Pattern
The main objective associated with the use of spatial filters is to identify a signal's inner space with given conditions (classes). In this way, the CSP algorithm identifies one inner space whereby conditions are maximally represented. When multiple conditions exist (more than two), the problem is complex and difficult to handle. In [4], the information theoretic feature selection (ITFE) was proposed based on the joint approximate diagonalization (JAD) algorithm that solves minimal achievable classification and multiclass problems. It is based on the maximization of mutual information between data X and gives the labels P * =argmax I(c, P T R) so that L rows from the data matrix X can be selected as a signal's inner space, which preserves most of the information. The implementation sequence of the algorithm is shown below.

Multiclass Filterbank Common Spatial Pattern
The main objective associated with the use of spatial filters is to identify a signal's inner space with given conditions (classes). In this way, the CSP algorithm identifies one inner space whereby conditions are maximally represented. When multiple conditions exist (more than two), the problem is complex and difficult to handle. In [4], the information theoretic feature selection (ITFE) was proposed based on the joint approximate diagonalization (JAD) algorithm that solves minimal achievable classification and multiclass problems. It is based on the maximization of mutual information between data X and gives the labels P * = argmax{I(c, P T R)} so that L rows from the data matrix X can be selected as a signal's inner space, which preserves most of the information. The implementation sequence of the algorithm is shown below. First, the covariance matrices need to be computed as R x|c i , i = 1, . . . , M, where M is the number of classes.
Subsequently, the JAD algorithm has to be deployed. It is based on Equations (7) and (8), and as stated in [16], for every covariance matrix belonging to M classes, a W matrix has to be identified which diagonalizes all the covariance matrices as W T R x|c i W = D c i , i = 1, . . . , M.
Next, every column of w j , j = 1, . . . , N, of W where w j is the j th column of matrix W is taken and changed so that it satisfies w T j R x|c i w j = 1; then, the mutual information is computed according to the equation below (9) [4]: Eventually, L columns of W have to be selected and applied to the data X using the dot product.

Riemannian Geometry
The fundamental idea of this geometry is based on the mapping of the covariance matrix on the space which conveniently represents the data. The method learns the curve-shaped spaces that comprise Euclidean spaces. Therefore, as with the surface of the earth, the covariance matrix is located on a curve-based-space, and one should approach this accordingly. In the BCI field, it is assumed that the existence and use of the EEG signal are located in the specific curve-shaped space. Accordingly, if the signal is mapped to the defined space, it can be used more effectively. In this study, the concepts of Riemannian geometry were described, and the reader's attention is redirected to [11] and to the references listed therein for further instructions.

Spatial Covariance Matrices
Covariance matrices are computed using the equation given in (10). Suppose the original recording of the EEG signal is stored in matrix X, which has a shape of N × T, where N is the number of channels and T is the number of samples. In the calibration mode, each EEG signal is divided into supervised segments called trials with sizes of N × T s , where T s denotes the number of samples per trial. To apply it into Riemannian geometry-based algorithms, it should satisfy the equation T s N, and it should be symmetric positive-definite (SPD), which means it must be diagonalized with real positive eigenvalues [11].

Spatial Covariance Matrices
Let us consider that one has a set of matrices P which have m × m dimensions. As represented in [17], this set can be defined as P(m) = {P ∈ S(m) |u T Pu > 0, ∀u ∈ R m , u 0}, where S(m) indicates the space of all symmetric matrices. Moreover, this creates a manifold M with the dimension m(m + 1)/2.
The geodesic distance denotes the minimum length of the path between two points on the manifold M. The geodesic distance between two SPD covariance matrices can be estimated using the equation below: where λ i , i = 1, . . . , n indicates the real eigenvalues of P −1 1 P 2 .

Approximation of SPD Matrices
In Riemannian geometry, the mean of n ≥ 1 SPD matrices, which is also referred to as the geometry mean, can be formulated based on the geodesic distance:

Tangent Space Mapping
Barachant et al. [11] also proposed a method which maps the covariance matrices in a tangent space of the Riemannian manifold. Each SPD matrix P ∈ P(m), which is a point in Riemannian geometry, has a mapped version in the tangent space with the same dimension as that of the manifold m(m + 1)/2. The tangent space mapping is computed using the following equation: As shown in Figure 3, each of the SPD matrices map onto the tangent space at point Q mean , which is a point calculated with the use of Equation (12) and vectorized by obtaining only the triangular part of the matrix that corresponds to the dimensions of the tangent space. where λ i , i=1, …, n indicates the real eigenvalues of P 1 -1 P 2 .

Approximation of SPD Matrices
In Riemannian geometry, the mean of n ≥ 1 SPD matrices, which is also referred to as the geometry mean, can be formulated based on the geodesic distance:

Tangent Space Mapping
Barachant et al. [11] also proposed a method which maps the covariance matrices in a tangent space of the Riemannian manifold. Each SPD matrix P ∈P(m), which is a point in Riemannian geometry, has a mapped version in the tangent space with the same dimension as that of the manifold m (m+1) 2 ⁄ . The tangent space mapping is computed using the following equation: As shown in Figure 3, each of the SPD matrices map onto the tangent space at point Q mean , which is a point calculated with the use of equation (12) and vectorized by obtaining only the triangular part of the matrix that corresponds to the dimensions of the tangent space.

Particle Swarm Optimization
Particle swarm optimization (PSO) is an optimization method that works like a genetic algorithm (GA) [19]. However, it is easier and can be implemented with a few lines of code. PSO originated from the traversing of flocks of birds and advances a problem by trying to improve the candidate solution in a continuous manner, based on a given quality ratio [20].
Q mean P 1 , P 2 , …, P n = arg min S i =upper(log(Q mean

Particle Swarm Optimization
Particle swarm optimization (PSO) is an optimization method that works like a genetic algorithm (GA) [18]. However, it is easier and can be implemented with a few lines of code. PSO originated from the traversing of flocks of birds and advances a problem by trying to improve the candidate solution in a continuous manner, based on a given quality ratio [19]. Algorithm 1. Simple particle swarm optimization (PSO) pseudocode.
for each position p of particle compute fitness 5.
set best of pbest as gbest 8.
update particles velocity and position 9. gbest is our result The uniqueness of the algorithm compared to GA is based on the fact that it stores the variables for each particle's personal best position (pbest), global best position (gbest), velocity v, and current position x. As shown in the pseudocode of Algorithm 1, the individual best positions can be identified for each particle, and the best one is selected among all the pbest values of all the particles. In the long run, the velocity and the position of each particle can be updated using equation [19]: where w indicates the inertia value that represents the percentage of the old velocity value which will be maintained, and c 1 ∈ [0, 1] and c 2 ∈ [0, 1] are known as the "acceleration coefficients" used for the selection of the pbest and gbest values, respectively. The function rand() generates random (i.e., different) values between zero and one.

Proposed Method
In this study, a number of remarkable algorithms were used. After the data was read, data augmentation was applied. The augmentation process was completed by moving the sliced windows described in Section 3.1. The general architecture of the proposed method is depicted in Figure 4.

Data Augmentation
For data augmentation, the sliced windows method with a window size of w and a moving time of t moving was used. At each time instant, the window w was used such that there was a specific moving time t moving , as represented in Figure 5.

FBCSP Algorithm
In this study, first, the filtering operation was used, as shown in Section 2.1, after the independent component analysis was adopted to reduce the noise of the data.

Data Augmentation
For data augmentation, the sliced windows method with a window size of w and a moving time of t moving was used. At each time instant, the window w was used such that there was a specific moving time t moving , as represented in Figure 5.

FBCSP Algorithm
In this study, first, the filtering operation was used, as shown in Section 2.1, after the independent component analysis was adopted to reduce the noise of the data.

Data Augmentation
For data augmentation, the sliced windows method with a window size of w and a moving time of t moving was used. At each time instant, the window w was used such that there was a specific moving time t moving , as represented in Figure 5.

FBCSP Algorithm
In this study, first, the filtering operation was used, as shown in Section 2.1, after the independent component analysis was adopted to reduce the noise of the data. After the noise reduction, the CSP algorithm was deployed, as shown in sections 2.1 and 2.2, and m rows of the spatially filtered data were selected. As mentioned earlier, they were concatenated. Suppose frequency bands and n t samples were selected in each time trial. Overall, n× m×n f × n t shaped data were taken, where n is the number of trials. The next step was the computation of the After the noise reduction, the CSP algorithm was deployed, as shown in Sections 2.1 and 2.2, and m rows of the spatially filtered data were selected. As mentioned earlier, they were concatenated. Suppose n f frequency bands and n t samples were selected in each time trial. Overall, n×(m × n f )× n t shaped data were taken, where n is the number of trials. The next step was the computation of the covariance matrices of the trials using the equation given in (10). Eventually, the data size was be n×(m × n f ) × (m × n f ), whereby the data comprises n square matrices.

PSD Algorithm
At the same time, the PSDs of the data were computed for the n p f frequency bands using three frequency bands; i.e., alpha, beta, and gamma. The dataset D psd obtained from the use of this algorithm had a size of n × (N ch n p f .

Tangent Space Mapping
After the completion of all the calculations indicated above, two types of data composed of covariance matrices and power spectral densities were retrieved. The next step in this process was the deployment of the tangent space mapping explained in Section 2.4. The result was the vector D ts i , i ∈ 1, . . . , n, where n is the number of data samples (trials) given in our dataset with a dimension of ((m n f )(m n f +1))/2. Subsequently, for each trial, the tangent space D ts i and the flattened covariance matrix D c i were concatenated. Consequently, the D temp matrix with the shape of n×((m n f ) 2 + (m n f )(m n f +1)/2) was obtained. Correspondingly, when the subsequent part was replaced with n new , the shape of D temp was n × n new .

FSBPSO Algorithm
In our study, the wrapper-based feature-selection algorithm called feature selection with the binary PSO algorithm FSBPSO was deployed. A similar approach was adopted in [20]. It is obvious from its name that FSBPSO is a feature-selection algorithm based on PSO optimization. As represented in Figure 6, the entire dataset was input at the beginning, and the FSBPSO initialized positions randomly by where j = j ∈ 1, . . . , N new , represents the number of features. Subsequently, based on the quality value, which determines the cost accuracy taken from the KNN algorithm, the pbest and gbest values were found, and the position and velocity values were updated based on Equations (14) and (15). At the end, a vector with a shape of N new that will consist of ones and zeros according to Equation (16) was achieved. At this stage, the positions of the features which best defined the data were retrieved. Herein, this feature selection was applied on the data D temp , and a better accuracy than the concatenated case with D psd was obtained. During the feature selection, the data with a linear interpolation scheme was interpolated to fill the missing gaps. This means that the non-selected features were effectively replaced with average values. Subsequently, to obtain the final data D, the feature selections of D temp and D psd were concatenated. Therefore, the final data D had shape dimensions equal to n × n new +N ch n f p . The last part was set to n final = n new +N ch n f p so that in the end, D had shape dimensions equal to n ×n final .

Architecture
The use of deep networks is very rare in the BCI field owing to the increased noise ratio and the lack of adequate data. In this study, the number of training samples or signal trials was improved using data augmentation, as mentioned earlier. The application of a number of large and deep networks has also been attempted. However, as the networks became deeper, overfitting increased. Thus, using very deep networks was stopped. Instead, to classify the data D, a 1D convolutional neural network (CNN) was used. It is not uncommon for CNNs to be used in the classification of EEG signals. In [22] and [12], 1D convolutional networks were used in combination with long shortterm memory (LSTM). In the first case [22], the evoked results revealed that the combined architecture just outperformed the CNN. However, when they used only LSTM, the result was poor in comparison to the CNN. In [13], the authors used stacked CNNs as the encoder part of the AutoEncoder with the FFT features on two frequency bands: i.e., the mu (8-13Hz) and beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) Hz) bands.
In this study, two types of architectures were used. The first one was composed of CNN and output softmax layers only. In the second one, the CNN layer was followed by fully connected layers with 100 output units. For training, the filter size was set to seven and the activation functions to rectified linear unit. To optimize our model, the Adam optimizer was used, and the learning rate applied varied based on the dataset and subjects. The number of epochs for training was set to 20. Correspondingly, increasing this number led to overfitting. At this stage, the positions of the features which best defined the data were retrieved. Herein, this feature selection was applied on the data D temp , and a better accuracy than the concatenated case with D psd was obtained. During the feature selection, the data with a linear interpolation scheme was interpolated to fill the missing gaps. This means that the non-selected features were effectively replaced with average values. Subsequently, to obtain the final data D, the feature selections of D temp and D psd were concatenated. Therefore, the final data D had shape dimensions equal to n× n new + N ch n p f . The last part was set to n final = n new + N ch n p f so that in the end, D had shape dimensions equal to n × n final .

Architecture
The use of deep networks is very rare in the BCI field owing to the increased noise ratio and the lack of adequate data. In this study, the number of training samples or signal trials was improved using data augmentation, as mentioned earlier. The application of a number of large and deep networks has also been attempted. However, as the networks became deeper, overfitting increased. Thus, using very deep networks was stopped. Instead, to classify the data D, a 1D convolutional neural network (CNN) was used. It is not uncommon for CNNs to be used in the classification of EEG signals. In [21] and [12], 1D convolutional networks were used in combination with long short-term memory (LSTM). In the first case [21], the evoked results revealed that the combined architecture just outperformed the CNN. However, when they used only LSTM, the result was poor in comparison to the CNN. In [13], the authors used stacked CNNs as the encoder part of the AutoEncoder with the FFT features on two frequency bands: i.e., the mu (8-13Hz) and beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) bands.
In this study, two types of architectures were used. The first one was composed of CNN and output softmax layers only. In the second one, the CNN layer was followed by fully connected layers with 100 output units.
For training, the filter size was set to seven and the activation functions to rectified linear unit. To optimize our model, the Adam optimizer was used, and the learning rate applied varied based on the dataset and subjects. The number of epochs for training was set to 20. Correspondingly, increasing this number led to overfitting.

Datasets
To show how well our algorithm works, two publicly available datasets, 2a and 2b from the BCI Competition IV, were selected. Both of them were mental imagery datasets, and both included leftand right-hand movements. These datasets were recorded during the imaginary movement of hands or feet and were sampled at a 250 Hz frequency rate. For detailed information, please refer to [22].
The dataset 2a consists of 22-channel data from nine subjects and was presplit into training and testing sets. However, dataset 2b comprised nine subjects with three channels of EEG signals.

Results
In this study, experiments have been conducted with the use of the Python 3.6 environment with the MNE-Python EEG signal processing tool on an Intel (maximum 4.7 GHz) core i7 PC with 16 GB of RAM. The training and testing of each dataset was conducted separately for each subject. In the analysis process for dataset 2a, for example, the first subject's training and testing process was completed on the A1T and A1E files, respectively. Herein, accuracies are listed only for the test set.
The results of dataset 2a are listed in Table 1. These values have been obtained using semi-supervised online learning methods, based on which our model was trained using predicted test data labels. In the data augmentation process represented in Section 3.2, the number of iterations was four, the moving time was 0.3 s, and the window size w was 3 s. In the spatial filtering process, the number of selected rows m was six. The parameters of our FSBPSO algorithm during the stage of feature selection are listed in Table 2. For comparison, the results from several prior publications were distinguished based on state-of-the-art approaches. It is clear from the table that our method has comparable results with other methods (sample size = four subjects). If the average values of the results were taken into consideration, one can see that our proposed method surpassed the majority of the other methods by reaching a classification accuracy of 80.44%, as shown in Table 1. The authors of SR-MDRM used a very similar approach to our method; however, in their work, a spatial filter regularized by coordinates of electrodes to ensure prior information from EEG channels and a minimum classification distance to the Riemannian mean was used. Furthermore, In WOLA-CSP [26], the WOLA algorithm, which consists of applying FFT to compute power spectrum density, shifting the spectrum of given frequencies measured by ERDSA, which is a specific-subject frequency selection algorithm, to the origin of the axis, and subsequently using IFFT to transform the spectrum back to the time domain, was used. Additionally, CSP to extract features and latent Dirichlet allocation (LDA) to classify the retrieved features were used; however, the data were still sparse.
In addition to this, the results obtained from the next dataset 2b are listed in Table 3. The dataset contained highly corrupted EEG signals, but the ICA algorithm managed to filter the signal. Data augmentation was conducted based on the same procedure as that used for the dataset 2a.
The features of the dataset have been obtained by the proposed algorithm, and the parameters of the FSBPSO algorithm are also the same as those for dataset 2a, as listed in Table 2. To compare the results of dataset 2b, the feature set was evaluated with other well-known machine learning techniques, including the latent Dirichlet allocation (LDA) and support vector machines, and constructed similar encoder parts of the AutoEncoder with CNNs based on the formulation of the CNN-SAE model of Yousef and Ugur [13]. As shown, our method yields better results for all the tested subjects except for Subject 4, and the average value of the classification accuracy (82.39 %) also outperforms the corresponding accuracies of the other studies, as indicated in Table 3.
The comparisons listed above were for two-class cases only; that is, for the right and left hands. For multiclass cases, 10-fold cross-validation classification results for the 2a dataset from the BCI competition IV [27] are represented for all the classes, including the left hand, right hand, tongue, and both feet, as shown in Table 4.
For comparison, three related results for the dataset are represented, including the TSLDA [11], which represents the tangent space mapping feature extraction method alongside the LDA classification method, and the CSP* and the LDA method, in which the best features were selected for each subject according to the FDR criterion [11] in association with the LDA classification method. The results for the MDRM minimum distance to Riemannian manifold method [11] were also presented. Our method outperformed all the referenced examples, including the latter one.

Discussion and Conclusions
In this study, a number of well-known feature extraction methods were combined for EEG signal processing, and one of the deep-learning-based approaches was described. In addition to this, several deep learning techniques were also evaluated, such as online learning (unsupervised learning) and transfer learning. It was found that by using online learning, one could increase the classification accuracies of the EEG signals by almost 2%, while transfer learning led to bad results. It was observed that very deep networks can cause overfitting. However, if any one layer in the CNN network does not perform well, the parameters of the model are not strong enough to allow the learning of the problem. Therefore, in instances where this occurred, another fully connected layer was added immediately after the layer with poor performance. Deep-learning-based techniques often require an extensive number of resources. However, in this study, this problem was solved with the use of data augmentation.