A Novel Convolutional Neural Network Classiﬁcation Approach of Motor-Imagery EEG Recording Based on Deep Learning

: Recently, Electroencephalography (EEG) motor imagery (MI) signals have received increasing attention because it became possible to use these signals to encode a person’s intention to perform an action. Researchers have used MI signals to help people with partial or total paralysis, control devices such as exoskeletons, wheelchairs, prostheses, and even independent driving. Therefore, classifying the motor imagery tasks of these signals is important for a Brain-Computer Interface (BCI) system. Classifying the MI tasks from EEG signals is difﬁcult to offer a good decoder due to the dynamic nature of the signal, its low signal-to-noise ratio, complexity, and dependence on the sensor positions. In this paper, we investigate ﬁve multilayer methods for classifying MI tasks: proposed methods based on Artiﬁcial Neural Network, Convolutional Neural Network 1 (CNN1), CNN2, CNN1 with CNN2 merged, and the modiﬁed CNN1 with CNN2 merged. These proposed methods use different spatial and temporal characteristics extracted from raw EEG data. We demonstrate that our proposed CNN1-based method outperforms state-of-the-art machine/deep learning techniques for EEG classiﬁcation by an accuracy value of 68.77% and use spatial and frequency characteristics on the BCI Competition IV-2a dataset, which includes nine subjects performing four MI tasks (left/right hand, feet, and tongue). The experimental results demonstrate the feasibility of this proposed method for the classiﬁcation of MI-EEG signals and can be applied successfully to BCI systems where the amount of data is large due to daily recording.


Introduction
In recent years, the use of brain signals from EEG electroencephalography has been widely explored for various applications with a major focus on the field of biomedical engineering. A Brain-Computer Interface (BCI) system, also referred to as brain-machine interaction, bridges the gap between humans and computers by translating thoughts into commands, which can be used to communicate with external devices like exoskeletons,

•
The performance of classifiers is improved by pre-processing of the EEG signals. We proceed with removal of EOG channels and with applying a bandpass filter.

•
We extracted frequential and spatial features by using WPD and CSP techniques respectively.

•
We showed that the proposed method based on CNN1 model gives the highest value of accuracy compared to the state-of-the-art.
The rest of this paper is organized as follows. In Section 2, we briefly review related work. Section 3 presents our five proposed methods for the classification of MI tasks in EEG signals. In Section 4, we analyze the experimental results that verify the effectiveness of the proposed methods. Section 5 draws a conclusion and provides direction for future research.

Related Work
Recently, the DL and traditional algorithms were combined with other methods to extract meaningful information from EEG signals. For example, in [7], the authors proposed a method based on the combination of CNN with the gradient boosting (GB) algorithm.
These algorithms are widely used in the fields of EEG signals, but their performance and accuracy in processing EEG signals are not satisfactory. Many researchers started investigating the potential for using various DL models for EEG signals analysis [8]. DL models, in particular CNN, can extract more robust and discriminating features [9,10].
Other models, such as Long Short-Term Memory (LSTM) [11], the Recurrent Neural Network (RNN) [12], Deep Belief Network (DBN), and the Sparse AutoEncoder SAE [13], are useful in time series applications. Research shows that the DL technology has performed well in the field of EEG signal processing [14], which indicates that the automatically extracted features are better than the ones that are manually extracted.
In [15], the authors propose a method that combines the multi-branch 3D CNN and the 3D representation of the EEG. The strength of the application of 3D CNN is that the temporal and spatial characteristics of EEG signals can be extracted simultaneously, and the relationship between them can be fully utilized. In [16], the authors explore CNN in combination with Filter Bank Common Spatial Pattern (FBCSP), a spatial filtering algorithm, and their own approach to extract temporal characteristics. In [17], four different CNN models with increasing depths are used to learn the spatial and temporal characteristics, which are then merged and sent to either an autoencoder or to a multilayer perceptron for classification.
The training phase of in-DL models is difficult on small data sets because they can have millions of parameters, which usually requires huge training data. There are not many public EEG data sets available, and those that are available are limited in size. This limits the scope of applying deep networks in this area. However, techniques, such as transfer learning, are of nature to present new avenues of using deep networks that are first pre-trained on large datasets and then fine-tuned for smaller datasets. These techniques show better performance and reduce the training time of the deep models [29].
Numerous variants of the CNN models were used for the image classification with good accuracy. One of them is the merging of several CNNs for feature aggregation.
Many researchers merge multiple CNN models and features [21][22][23][24][25][26][27][28][29] to extract intermediate features and merge models with architectures and have had some success. Although the researchers applied the CNN model and other DL models on the MI EEG data to obtain good accuracy values, they could not achieve major improvements over technical machine learning [13,30]. This is due to the fact that the EEG signal is characterized by a low SNR, low spatial characteristics, and is difficult to interpret due to its non-stationarity.
In this work, we propose five methods for the classification of MI-EEG signals. These methods are based on Artificial Neural Network (ANN), Convolutional Neural Network 1 (CNN1), CNN2, CNN1 with CNN2 merged, and the modified CNN1 with CNN2 merged. These methods start with a pre-processing step where the EOG channels have been removed and a bandpass filter  has been applied to the EEG data. Then we extract frequency characteristics using Wavelet Packet Decomposition (WPD) and spatial characteristics using Common Spatial Pattern (CSP). The obtained characteristics are transmitted to the proposed classifiers: ANN, CNN1, CNN2, merged CNNs, and modified merged CNNs. Each CNN has a different depth. Our CNN-based method reports improved performance for EEG MI data using its frequency and spatial characteristics. We also show that merging the two CNN models and adding LSTM layers in the CNN model does not always give better results.

Data Set Description
The well-known database "Competition IV 2a" [31] is employed to train then to test our method. It enables the comparison of our results with those from state-of-theart methods.
The subjects accomplish four MI tasks. These tasks represent the Left hand, Tongue, Feet, and Right Hand. Also, data are split up into short runs. It is worth noting that each run carries 48 trials of each activity of the MI.
The collection of the data consists of two sessions in two days, where the session comprises six runs with a short break between them. Thus, for each MI activity, a total of 288 trials is collected.
Each session is started by recording approximately 5 min of EEG data to estimate the influence of the 3 EOG channels. This recording is divided into 3 blocks: (1) 2 min with eyes open (looking at a fixing cross on the screen), (2) 1 min with eyes closed, and (3) 1 min with eye movements (Figure 1). Due to technical issues, the EOG block is shorter for Subject 4 and only contains the eye movement condition. The timing of data acquisition is shown in Figure 2. At the start of a test (t = 0 s), a fixing cross appears on the black screen. In addition, a short acoustic warning tone is emitted. After 2 s (t = 2 s), a cue in the form of an arrow pointing left, right, down, or up (corresponding to one of the 4 classes) appears and remains on screen for 1.25 s. This prompted subjects to perform the desired MI task until the fixation cross disappeared from the screen at t = 6 s. Finally, a short break with a black screen is used. improved performance for EEG MI data using its frequency and spatial characteristics. We also show that merging the two CNN models and adding LSTM layers in the CNN model does not always give better results.

Data Set Description
The well-known database "Competition IV 2a" [31] is employed to train then to test our method. It enables the comparison of our results with those from state-of-the-art methods.
The subjects accomplish four MI tasks. These tasks represent the Left hand, Tongue, Feet, and Right Hand. Also, data are split up into short runs. It is worth noting that each run carries 48 trials of each activity of the MI.
The collection of the data consists of two sessions in two days, where the session comprises six runs with a short break between them. Thus, for each MI activity, a total of 288 trials is collected.
Each session is started by recording approximately 5 min of EEG data to estimate the influence of the 3 EOG channels. This recording is divided into 3 blocks: (1) 2 min with eyes open (looking at a fixing cross on the screen), (2) 1 min with eyes closed, and (3) 1 min with eye movements (Figure 1). Due to technical issues, the EOG block is shorter for Subject 4 and only contains the eye movement condition. The timing of data acquisition is shown in Figure 2. At the start of a test (t = 0 s), a fixing cross appears on the black screen. In addition, a short acoustic warning tone is emitted. After 2 s (t = 2 s), a cue in the form of an arrow pointing left, right, down, or up (corresponding to one of the 4 classes) appears and remains on screen for 1.25 s. This prompted subjects to perform the desired MI task until the fixation cross disappeared from the screen at t = 6 s. Finally, a short break with a black screen is used.  Furthermore, the data were sampled at a frequency of 250 Hz and were filtered by a band pass-filter between 0.5 Hz to 100 Hz. In order to remove line noise, another 50 Hz notch filter was utilized.

Proposed Work
In this section, we will detail our proposed method for the classification of MI-EEG signals. Our proposed method ( Figure 3) begins with the application of a 7 to 30 Hz band-  improved performance for EEG MI data using its frequency and spatial characteristics. We also show that merging the two CNN models and adding LSTM layers in the CNN model does not always give better results.

Data Set Description
The well-known database "Competition IV 2a" [31] is employed to train then to test our method. It enables the comparison of our results with those from state-of-the-art methods.
The subjects accomplish four MI tasks. These tasks represent the Left hand, Tongue, Feet, and Right Hand. Also, data are split up into short runs. It is worth noting that each run carries 48 trials of each activity of the MI.
The collection of the data consists of two sessions in two days, where the session comprises six runs with a short break between them. Thus, for each MI activity, a total of 288 trials is collected.
Each session is started by recording approximately 5 min of EEG data to estimate the influence of the 3 EOG channels. This recording is divided into 3 blocks: (1) 2 min with eyes open (looking at a fixing cross on the screen), (2) 1 min with eyes closed, and (3) 1 min with eye movements (Figure 1). Due to technical issues, the EOG block is shorter for Subject 4 and only contains the eye movement condition. The timing of data acquisition is shown in Figure 2. At the start of a test (t = 0 s), a fixing cross appears on the black screen. In addition, a short acoustic warning tone is emitted. After 2 s (t = 2 s), a cue in the form of an arrow pointing left, right, down, or up (corresponding to one of the 4 classes) appears and remains on screen for 1.25 s. This prompted subjects to perform the desired MI task until the fixation cross disappeared from the screen at t = 6 s. Finally, a short break with a black screen is used.  Furthermore, the data were sampled at a frequency of 250 Hz and were filtered by a band pass-filter between 0.5 Hz to 100 Hz. In order to remove line noise, another 50 Hz notch filter was utilized.

Proposed Work
In this section, we will detail our proposed method for the classification of MI-EEG signals. Our proposed method ( Figure 3) begins with the application of a 7 to 30 Hz band- Furthermore, the data were sampled at a frequency of 250 Hz and were filtered by a band pass-filter between 0.5 Hz to 100 Hz. In order to remove line noise, another 50 Hz notch filter was utilized.

Proposed Work
In this section, we will detail our proposed method for the classification of MI-EEG signals. Our proposed method ( Figure 3) begins with the application of a 7 to 30 Hz bandpass filter on the BCI Competition IV 2a dataset. Then, we eliminate the three EOG channels and keep only the 22 EEG channels. Subsequently, we apply to each EEG channel, the WPD method by the extraction of frequency characteristics followed by the CSP algorithm for the extraction of spatial characteristics.
In this section, we will detail our proposed method for the classification of MI-EEG signals. Our proposed method ( Figure 3) begins with the application of a 7 to 30 Hz bandpass filter on the BCI Competition IV 2a dataset. Then, we eliminate the three EOG channels and keep only the 22 EEG channels. Subsequently, we apply to each EEG channel, the WPD method by the extraction of frequency characteristics followed by the CSP algorithm for the extraction of spatial characteristics. The extracted features from CSP are considered as input for our five offered models: ANN, CNN1, CNN2, Merged CNNs, and Modified merged CNNs.

Wavelet Packet Decomposition
Coifman et al. introduced the wavelet packet and proposed the concept of orthogonal wavelet packet based on orthogonal wavelets. The Wavelet Packet Transformation allows decomposing of the low-frequency part of the signal and the high-frequency part of the signal in a more detailed way. Moreover, this decomposition has neither redundancy nor omission, in order to allow a better time-frequency localization analysis capability than the wavelet transform for vibration signals containing medium and high-frequency information. Towards improving the accuracy of EEG recognition. It is an idea that can be understood as the decomposition of space.
The bandwidth of the EEG signal was chosen from the frequency band of 8 to 31 Hz. Figure 4 shows the frequency decomposition at each step that is obtained from filter banks.  The extracted features from CSP are considered as input for our five offered models: ANN, CNN1, CNN2, Merged CNNs, and Modified merged CNNs.

Wavelet Packet Decomposition
Coifman et al. introduced the wavelet packet and proposed the concept of orthogonal wavelet packet based on orthogonal wavelets. The Wavelet Packet Transformation allows decomposing of the low-frequency part of the signal and the high-frequency part of the signal in a more detailed way. Moreover, this decomposition has neither redundancy nor omission, in order to allow a better time-frequency localization analysis capability than the wavelet transform for vibration signals containing medium and high-frequency information. Towards improving the accuracy of EEG recognition. It is an idea that can be understood as the decomposition of space.
The bandwidth of the EEG signal was chosen from the frequency band of 8 to 31 Hz. Figure 4 shows the frequency decomposition at each step that is obtained from filter banks. Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 18

Common Spatial Pattern
CSP algorithm is a feature extraction method. It uses the theory of the diagonalization of the covariance matrix in a two-class signal. The main idea of CSP is to find the optimal projection matrix, which maximizes the variance for one class while minimizing the variance for the other class. Consequently, this algorithm achieves a maximum, unlike the two classes, for example, the right hand and left hand.

Proposed ANN Model
Our ANN model is composed of a first layer, a last layer, and between them, several intermediate layers. The first layer contains three elements: Dense layer, activation function, and Dropout layer. Instead of Sigmoid, the activation may be through a Rectified Linear Unit (ReLU), Tanh, an Exponential linear unit (ELU), or Scaled Exponential Linear Unit (SELU). The intermediate layers are composed of seven blocks. Each block starts with a Batch Normalization layer followed by the dense layer, activation function, and dropout layer. The last layer is dense with the SoftMax activation function.
Batch normalization is a technique for improving the speed, performance, and stability of artificial neural networks.

Common Spatial Pattern
CSP algorithm is a feature extraction method. It uses the theory of the diagonalization of the covariance matrix in a two-class signal. The main idea of CSP is to find the optimal projection matrix, which maximizes the variance for one class while minimizing the variance for the other class. Consequently, this algorithm achieves a maximum, unlike the two classes, for example, the right hand and left hand.

Proposed ANN Model
Our ANN model is composed of a first layer, a last layer, and between them, several intermediate layers. The first layer contains three elements: Dense layer, activation function, and Dropout layer. Instead of Sigmoid, the activation may be through a Rectified Linear Unit (ReLU), Tanh, an Exponential linear unit (ELU), or Scaled Exponential Linear Unit (SELU). The intermediate layers are composed of seven blocks. Each block starts with a Batch Normalization layer followed by the dense layer, activation function, and dropout layer. The last layer is dense with the SoftMax activation function.
Batch normalization is a technique for improving the speed, performance, and stability of artificial neural networks. The most common activation function for ANNs is the ReLu: It is less computationally expensive than some other common activation functions like Tanh and Sigmoid because the mathematical operation is simpler and the activation is sparser. Since the function outputs 0 when x ≤ 0, there is a considerable chance that a given unit does not activate at all.
(x) = max (0, x). Its outputs x when x is positive and outputs 0 otherwise. We tested the four activation functions ( Figure 5): ReLu, ELU, SELU, and Tanh. The most common activation function for ANNs is the ReLu: It is less computationally expensive than some other common activation functions like Tanh and Sigmoid because the mathematical operation is simpler and the activation is sparser. Since the function outputs 0 when ≤ 0, there is a considerable chance that a given unit does not activate at all.
( ) = max (0, ). Its outputs when is positive and outputs 0 otherwise. We tested the four activation functions ( Figure 5): ReLu, ELU, SELU, and Tanh. The proposed ANN model consists of 117,060 parameters. We have used the ADAM optimizer for weight updates and categorical cross-entropy loss. Then, we trained the network by using a batch size of 16 and 1000 training epochs. Figure 6 illustrates the proposed ANN architecture used to classify MI tasks. The proposed ANN model consists of 117,060 parameters. We have used the ADAM optimizer for weight updates and categorical cross-entropy loss. Then, we trained the network by using a batch size of 16 and 1000 training epochs. Figure 6 illustrates the proposed ANN architecture used to classify MI tasks. The most common activation function for ANNs is the ReLu: It is less computationally expensive than some other common activation functions like Tanh and Sigmoid because the mathematical operation is simpler and the activation is sparser. Since the function outputs 0 when ≤ 0, there is a considerable chance that a given unit does not activate at all.
( ) = max (0, ). Its outputs when is positive and outputs 0 otherwise. We tested the four activation functions ( Figure 5): ReLu, ELU, SELU, and Tanh. The proposed ANN model consists of 117,060 parameters. We have used the ADAM optimizer for weight updates and categorical cross-entropy loss. Then, we trained the network by using a batch size of 16 and 1000 training epochs. Figure 6 illustrates the proposed ANN architecture used to classify MI tasks.

Proposed CNN Models
We have proposed two CNN models; the first one ( Figure 7a) starts with 6 blocks. Each block has a convolutional layer, an activation function, and a Max Pooling step. Then we find the Flatten layer. This layer consists in converting the data into a one-dimensional table which will be considered as input in the next layer. We flatten the output of the convolution layers to create a single long feature vector. And it is connected to the final classification model, called the fully connected layer. In other words, we put all the discrete data on a single line and make connections with the 6 dense layers.
The convolutional layer is the key component of CNN and always constitutes at least their first layer. Its goal is to identify the presence of a set of features in the images received as input. The Max pooling operation consists in reducing the size of the data, while preserving their important characteristics. The Flatten layer makes it possible to flatten the tensor and reduce its dimension. We have used four activation functions: ReLU, SELU, ELU, and Tanh. Our second CNN model (Figure 7b) has 4 blocks (like the first model) followed by an LSTM layer. The LSTM is a type of RNN.
The idea behind this choice of neural network architecture is to divide the signal between what is important in the short term, through the hidden state (analogous to the output of a simple RNN cell), and what is important in the long term, through the cell state, which will be explained below. Thus, the global functioning of an LSTM can be summarized in 3 steps:

1.
Detecting relevant information from the past, taken from the cell state through the forget gate.

2.
Select, from the current input, those that will be relevant in the long term via the input gate. These will be added to the cell state, which acts as a long memory.

3.
Draw from the new cell state the important short-term information to generate the next hidden state through the output gate.
The LSTM layer is followed by a Flatten layer and 3 Dense layers. CNNs are commonly used to solve problems with spatial data. LSTMs are best suited for analyzing temporal and sequential data. However, it is useful to think of our CNN 2 architecture as defining two sub-models: the CNN model for feature extraction and the LSTM model for interpreting features across time steps. The proposed CNN1 model consists of 23,767,300 parameters and 696,324 parameters for the proposed CNN2 model. We have used for the two CNN, the ADAM optimizer which for weight updates and categorical cross-entropy loss. Then, we trained the network by using a batch size of 16 and 1000 training epochs.
The ADAM optimizer is a gradient descent method used for the minimization of an objective function which is written as a sum of differentiable functions. Categorical cross-entropy is a loss function that is used in multi-class classification tasks. Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 18 (a) (b)

Proposed Merged CNNs Model
We merged the two models CNN1 and CNN2, as shown in Figure 8. We build this fusion by using monolayers perceptron. The resulting multi-layer CNNs features from the concatenation layer are fed to the monolayers perceptron.

Proposed Merged CNNs Model
We merged the two models CNN1 and CNN2, as shown in Figure 8. We build this fusion by using monolayers perceptron. The resulting multi-layer CNNs features from the concatenation layer are fed to the monolayers perceptron.  The monolayers perceptron consists of one hidden layer, each having 8 nodes. The monolayers perceptron method is then trained on the combined feature vector, and the output is sent to the SoftMax layer to get the probability score for the MI classes.

Proposed Modified Merged CNNs Model
We have modified our proposed method of fusing CNN1 and CNN2 while reducing the number of blocks and dense layers, and also eliminating the Max Pooling step. In addition, we combine CNN1 and CNN2 modified (Figure 9) by removing the final SoftMax classification layer from each of them and concatenating the features using perceptron monolayers.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 18 The monolayers perceptron consists of one hidden layer, each having 8 nodes. The monolayers perceptron method is then trained on the combined feature vector, and the output is sent to the SoftMax layer to get the probability score for the MI classes.

Proposed Modified Merged CNNs Model
We have modified our proposed method of fusing CNN1 and CNN2 while reducing the number of blocks and dense layers, and also eliminating the Max Pooling step. In addition, we combine CNN1 and CNN2 modified (Figure 9) by removing the final SoftMax classification layer from each of them and concatenating the features using perceptron monolayers.  The proposed merged CNNs model consists of 24,463,660 parameters and 24,691,972 parameters for the proposed modified merged CNNs model. We have used for the two CNN, the ADAM optimizer for weight updates and categorical cross-entropy loss. Then, we trained the network by using a batch size of 16 and 1000 training epochs.

Results and Discussion
In this section, we analyzed our proposed methods to determine the optimal classifier. According to the instruction of the dataset, a classifier was trained and tested for each subject.
Training a simple DL model takes numerous hours and even days on laptops (typically on CPUs) which leads to an impression that DL requires big systems to run and execute. GPUs and TPUs, on the other hand, can train these models in a matter of minutes or seconds. But not everyone can afford a GPU because they are expensive. That is where Google Colab comes into play. Google Colab is a free cloud service hosted by Google to encourage Machine Learning and Artificial Intelligence research. It is a powerful platform for learning and quickly developing DL models in Python (3.7 version) As mentioned in Section 3.1, each subject has 288 trials; we used 80% of trials for training and 20% of trials for testing.

Performance Analysis for Subject 3
The performance measures obtained for subject 3 are shown in Table 1.  Table 1 shows the precision, recall, and F1 score of the proposed methods based on ANN, CNN1, CNN2, Merged CNN1 with CNN2, Modified merged CNN1 with CNN2 model for Subject 3. We notice that the five models gave very similar results.
We provide the confusion matrix for the proposed methods in Figure 10. The diagonal elements demonstrate the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix, the better, showing many correct predictions.   Tables 2-6 give the accuracy values obtained by the application of our proposed methods. From these tables, we can conclude that the proposed CNN1 method achieves the best average classification accuracy of 68.77% with the Tanh function.

Performance Analysis for All Subjects
On the other hand, the comparison between the results obtained by the proposed models CNN1 and CNN2 shows that the addition of the LSTM layers in the CNN model does not improve this model. In addition, merging the two CNNs without adding LSTM layers gives acceptable results.
According to Tables 5 and 6, removing the Max Pooling steps, LSTM layers, and the final SoftMax classification layer from CNN1 and CNN2, and concatenating the features using a perceptron monolayer, improve the results of motor imagery tasks classification. The modified merged CNNs give an improvement in accuracy but the accuracy achieved by CNN1 was the best overall despite being simple.
We can also conclude that based on the comparison of the subject-specific precision obtained by our proposed models, each model gave better results for a different subject.
A comparison of the classification accuracies of the proposed method based on CNN1 and other state-of-the-art methods is presented in Table 7.
As shown in Table 7, our proposed CNN1 method is better than all other machine learning methods with an average classification accuracy of 68.77%, while the NB method is the worst with an average classification accuracy of 58.20%. On the other hand, the average classification accuracy of the FBCSP is 67.21%, which is indicative of the result of the BCI competition.
FBCSP [32] applied to the BCI competition achieved acceptable results with the handcrafted characteristics for each subject. However, it is impossible to find such an optimal craft characteristic for each new subject; therefore, all methods were evaluated under the same conditions in our experiments. In [33], the authors used the Latent Dirichlet Allocation (LDA) to classify Motor Imagery. They achieved an accuracy value of 63.81%.
In reference [34], the authors compared their proposed PSO-based Fuzzy Logic System (FLS) with many competing approaches, including Naïve Bayes (NB), AdaBoostM2 ensemble (Ensemble), and Support Vector Machines (SVM). They applied these machine learning methods using the functions of Matlab, namely, respectively; fitcnb, fitensemble, and fitcecoc. For the SVM method, they used the one-vs-all encoding scheme with a binary SVM learner using the Gaussian kernel. And for the Ensemble method, the authors selected AdaBoostM2 as the classification algorithm, and the decision tree as the learner, with the ensemble learning cycle number set to 100.
Our proposed method based on CNN1 gave the best accuracy value that may be due to the simple preprocessing we applied to the dataset. In addition, the extraction of frequency characteristics using WPD and spatial characteristics using CSP, improved the rate of classification accuracy of MI.
First, our experimental results confirmed the feasibility of the approach based on the CNN1 in the EEG domain, then the effectiveness of the proposed method compared to the methods of the state of the art. We see that the resulting networks successfully learn important characteristics of MI-EEG signals, thus improving overall performance. However, there are still several difficult issues to resolve:

•
In this work, merging the two CNNs and adding the LSTM layers did not improve the classification result. These two points can work in this direction either by adding other temporal characteristics, for example, or by applying other fusion methods other than concatenation and the merging by the perceptron monolayer. • According to Table 7, the methods of the state of the art sometimes surpass our proposed method based on the CNN1 models for a few subjects. It is expected that various architectural extensions, such as adding more convolutional layers, can be applied to our proposed method based on CNN1; however, the effects of these extensions for the EEG domain are not clear. Therefore, we will attempt to extend the architecture of the merged CNNs to determine if the architectural change can affect the overall performance of the system.
As discussed in this section, the use of methods based on the fusion of CNNs can be improved in several directions. We will discuss the aforementioned issues to improve the performance of MI-EEG classification.

Conclusions
In this paper, we proposed five methods for the classification of four-class motor imagery EEG signals using DL. The results obtained by the application of the proposed CNN1 model gave better results compared to our proposed and state-of-the-art methods.
The proposed fusion models show an acceptable classification rate of the EEG MI signals; therefore, it would be very interesting to apply the multi-layer CNN fusion models on other EEG datasets. We wish to study other feature fusion methods in order to improve the performance of our proposed method.

Data Availability Statement:
The dataset used in this study is public and can be found at the following links: BCI Competition IV dataset 2a http://www.bbci.de/competition/iv/, accessed on 30 January 2020.

Conflicts of Interest:
The authors declare no conflict of interest.