Representation Learning for Motor Imagery Recognition with Deep Neural Network

: This study describes a method for classifying electrocorticograms (ECoGs) based on motor imagery (MI) on the brain–computer interface (BCI) system. This method is different from the traditional feature extraction and classiﬁcation method. In this paper, the proposed method employs the deep learning algorithm for extracting features and the traditional algorithm for classiﬁcation. Speciﬁcally, we mainly use the convolution neural network (CNN) to extract the features from the training data and then classify those features by combing with the gradient boosting (GB) algorithm. The comprehensive study with CNN and GB algorithms will profoundly help us to obtain more feature information from brain activities, enabling us to obtain the classiﬁcation results from human body actions. The performance of the proposed framework has been evaluated on the dataset I of BCI Competition III. Furthermore, the combination of deep learning and traditional algorithms provides some ideas for future research with the BCI systems.


Introduction
Brain-computer interface (BCI) is a state-of-the-art technology serving as a direct communication pathway between a human brain and an external device. BCI systems can provide communication and control capabilities to humans without depending on the brain's normal output pathways of peripheral nerves and muscles. BCI systems translate neuronal activities into user commands, messages, or other signals [1,2].
BCI systems based on sensorimotor rhythms are known as motor imagery (MI) BCI systems [2]. Sensorimotor rhythms include alpha (8)(9)(10)(11)(12)(13) and beta (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26) Hz) frequency bands [3,4]. When a human imagines a motor action without any actual movement, the power of alpha and beta rhythms can decrease or increase in the sensorimotor cortices over the contralateral hemisphere and the ipsilateral hemisphere; this phenomenon is called event-related desynchronization/synchronization (ERD/ERS) [5,6]. The imagination of motor tasks can be decoded to user intent by MI-based BCI systems [7]. Figure 1 illustrates the block diagram of BCI systems for MI classification. The complete scheme includes three main stages. In the first stage, the biomedical signal can be acquired from the users. Various kinds of brain signals have been used as the basis for interpreting the intentions of users. BCI records brain activities through non-invasive and invasive modalities [8]. The most common types of signals include electrophysiological brain activity acquired over the invasive modalities [8]. The most common types of signals include electrophysiological brain activity acquired over the scalp electroencephalogram (EEG), electrophysiological brain activity recorded beneath the skull electrocorticogram (ECoG), and electrophysiological brain activity acquired from within the parenchyma local field potentials (LFPs) and single-neuron action potentials (single units) [9]. All of these major modalities for BCI record microvolt-level extracellular potentials generated by neurons in the cortical layers [10]. Non-invasive techniques such as EEG have been widely used in many important BCI systems, including two-dimensional and three-dimensional BCI control [11,12]. Compared with EEG, invasive techniques such as ECoG provide superior signal quality, higher temporal and spatial resolution, broader bandwidth, higher amplitude, better signal-to-noise ratio (SNR), and lower vulnerability to artifacts such as blinks and eye movement [13]. In the second stage, the signal processing procedure converts digitized signals into commands that operate an output device [11,14,15] (e.g., industrial robot arms, wheelchairs, quadcopters). The signal processing stage, which includes feature extraction and feature translation, is the main component of the entire system. In the third stage, brain activities can be translated into control signals that drive an output device [16,17]. Brain functional activities associated with cognitive and behavioral events can be analyzed from the signal processing stage to classify different mental tasks to assess the performance of MI-based BCI systems [1]. How to effectively learn representations of brain activities is a key point of BCI systems. To date, machine learning technology powers many aspects of brain signal analysis. Conventional machine learning techniques were limited in their ability to process natural data in their raw form to obtain hand-designed features [18]. Traditional brain signal analysis begins with preprocessing, and then hand-crafted feature representation can be extracted. Finally, extracted feature vectors are fed into classifiers to classify different MI tasks. Many individuals or combined measures have been applied to brain activity analyses, such as band power, power spectral density, common spatial patterns, wavelet transform, autoregressive models, local binary pattern operators, and nonlinear measures (e.g., approximate entropy, sample entropy, fractal dimension, fractal intercept, and lacunarity) [1,13,[18][19][20][21][22][23][24][25][26][27][28][29]. Because brain functional activities exhibit dynamic, transient, and non-stationary characteristics, acquisition signals contain numerous noises. The hand-crafted features may result in some degree of information loss during the process of feature extraction. Deep neural networks allow the system to input features containing raw spatial information, and an appropriate componential structure can be applied to learn distributed representations of data with multiple layers of extraction to make optimal classifications. Therefore, we Brain functional activities associated with cognitive and behavioral events can be analyzed from the signal processing stage to classify different mental tasks to assess the performance of MI-based BCI systems [1]. How to effectively learn representations of brain activities is a key point of BCI systems. To date, machine learning technology powers many aspects of brain signal analysis. Conventional machine learning techniques were limited in their ability to process natural data in their raw form to obtain hand-designed features [18]. Traditional brain signal analysis begins with preprocessing, and then handcrafted feature representation can be extracted. Finally, extracted feature vectors are fed into classifiers to classify different MI tasks. Many individuals or combined measures have been applied to brain activity analyses, such as band power, power spectral density, common spatial patterns, wavelet transform, autoregressive models, local binary pattern operators, and nonlinear measures (e.g., approximate entropy, sample entropy, fractal dimension, fractal intercept, and lacunarity) [1,13,[18][19][20][21][22][23][24][25][26][27][28][29]. Because brain functional activities exhibit dynamic, transient, and non-stationary characteristics, acquisition signals contain numerous noises. The hand-crafted features may result in some degree of information loss during the process of feature extraction. Deep neural networks allow the system to input features containing raw spatial information, and an appropriate componential structure can be applied to learn distributed representations of data with multiple layers of extraction to make optimal classifications. Therefore, we explore the capabilities of deep-learning methods for modeling cognitive events from brain activities.
Although deep neural networks have attracted enthusiastic interest within large-scale image recognition, video recognition, and natural language processing, they remain rela-Electronics 2021, 10, 112 3 of 13 tively unexplored in MI-based BCI systems [30][31][32]. One of the main reasons is that the number of samples in public MI-based datasets is limited, thus making such data less adequate for training large-scale deep neural networks with millions of parameters [33]. However, the advantages of deep neural networks over traditional brain activity analyses begin to appear when the scale of datasets becomes very large or the dimension of samples becomes very high. Nevertheless, convolutional neural network (CNN), deep belief networks (DBN), and recurrent neural networks (RNN) have been employed to learn representations from EEG [33][34][35][36][37]. Li et al. (2017) developed a new neuroscience-motivated parametric CNN, which was based on parameterized convolutional filters, to consider the analysis of EEG to understand the underlying features related to the classification. Relevant experimental results showed that the proposed model outperforms conventional CNN architectures and all compared classification methods [34]. A DBN formed by a plurality of restricted Boltzmann machines (RBM) has been used to extract EEG features, and each RBM can be trained greedily and unsupervised. The performance of the proposed algorithm can achieve a 4-6% accuracy increase compared to other classifiers [35]. Long short-term memory (LSTM) was used to learn features from EEG, and then the dense layer was used for classification to obtain higher average accuracies in comparison with the conventional techniques [36]. CNN and LSTM networks were utilized to extract spatial, spectral, and temporal invariant representations from EEG data. Empirical evaluation of the cognitive load classification task demonstrated a 6.4% accuracy increase over current state-of-the-art approaches [33]. CNN and LSTM networks are employed to extract spatial and temporal patterns from EEG data, and deep forest models are used in conjunction to obtain a stronger classifier [37].
In recent years, CNN has been gradually applied to identify MI tasks in EEG-based BCI systems. The key challenge in correctly identifying MI tasks from acquired brain signals is constructing a model that is sufficiently robust for analyses of signals in time, frequency, and space. Numerous attempts have been made to improve the design of CNN architecture in a bid to achieve better performance. The convolutional neural network, which combines artificial neural networks and deep learning, is a special type of deep neural network. Its connection between neurons can take advantage of local connection architecture and shared weights. CNN contains fewer connections and parameters. The computational complexity of the network can thus be significantly reduced. CNN, which has a structure similar to biological neural networks, is more suitable for the analysis and processing of brain activities. We propose a novel approach to learning representations from ECoG that relies on deep learning with a gradient boosting algorithm to inspire state-of-the-art MI classification.
In this study, we propose an algorithm to learn representations of brain activities associated with MI depending on deep learning and to classify different MI tasks for ECoG-based BCI systems. The remainder of this paper is organized as follows. Section 2 describes the experimental dataset. The methods are introduced in Section 3. Section 4 presents the results. Finally, discussions and conclusions are summarized at the end of this paper.

ECoG Dataset
The experimental data are obtained from the dataset I of the BCI Competition III, which includes one subject with focal epilepsy. This is the only dataset in BCI Competitions for motor imagery based on ECoG recordings. Although the ECoG data were selected from one subject, the subdural electrode arrays were planted within the cortex of the subject suffering from focal epilepsy for one to two weeks. The patient cannot focus on MI for a long time due to needing some days to recover after the implantation surgery. It is impossible to conduct a long-time experiment, and therefore, only a small amount of data could be recorded. During the experiment, the recording structure might experience slight changes concerning electrode positions and impedances. Brain activities exhibit different states concerning motivation or fatigue across time. Thus, the experimental design of MI-based BCI systems is very challenging [38].
The experimental procedure shows that the training and test trials are recorded from two different days with an approximately one-week interval. During the BCI experiment, the patient, facing a monitor, is seated in a bed and is asked to repeatedly perform an imagined movement of either the left small finger or the tongue. The 8 × 8 platinum electrode grid is implanted on the contralateral (right) motor cortex of the patient's right hemisphere to record ECoG data. All ECoG data are recorded with 64 active electrodes. The locations of the primary motor cortex are shown in Figure 2a. Figure 2b depicts the positions of 64 channels. All recording activities are performed with a sampling rate of 1000 Hz. The ECoG dataset consists of a training dataset and a test dataset. The imagination duration starts with a cue that is presented in the form of a picture depicting MI tasks. Each trial is recorded for 3 s, as illustrated in Figure 2c. The recorded duration starts 0.5 s after the visual cue has ended, to avoid visually evoked potentials.
Electronics 2021, 10, x FOR PEER REVIEW 4 of 13 on MI for a long time due to needing some days to recover after the implantation surgery. It is impossible to conduct a long-time experiment, and therefore, only a small amount of data could be recorded. During the experiment, the recording structure might experience slight changes concerning electrode positions and impedances. Brain activities exhibit different states concerning motivation or fatigue across time. Thus, the experimental design of MI-based BCI systems is very challenging [38]. The experimental procedure shows that the training and test trials are recorded from two different days with an approximately one-week interval. During the BCI experiment, the patient, facing a monitor, is seated in a bed and is asked to repeatedly perform an imagined movement of either the left small finger or the tongue. The 8 × 8 platinum electrode grid is implanted on the contralateral (right) motor cortex of the patient's right hemisphere to record ECoG data. All ECoG data are recorded with 64 active electrodes. The locations of the primary motor cortex are shown in Figure 2a. Figure 2b depicts the positions of 64 channels. All recording activities are performed with a sampling rate of 1000 Hz. The ECoG dataset consists of a training dataset and a test dataset. The imagination duration starts with a cue that is presented in the form of a picture depicting MI tasks. Each trial is recorded for 3 s, as illustrated in Figure 2c. The recorded duration starts 0.5 s after the visual cue has ended, to avoid visually evoked potentials.

Method
The architecture of the proposed ECoG-based BCI system is summarized in Figure 3. It contains three stages: preprocessing, unsupervised feature extraction, and classification. Different stages of the scheme are described in detail in the following sections.

Preprocessing
The preprocessing procedure, which is crucial for denoising the signal analysis, aims to remove both high-frequency noise and low-frequency activities and subsequently to reduce the size of ECoG data and to remove artifacts. For this purpose, the ECoG is first downsampled to 100 Hz. Then, the signals are filtered between 0.5 and 30 Hz using a 5th order digital Butterworth filter. Finally, signals between 0.5 and 30 Hz exhibit the ERD/ERS phenomenon of MI tasks with reduced eye movement and electromyogram artifacts.

Method
The architecture of the proposed ECoG-based BCI system is summarized in Figure 3. It contains three stages: preprocessing, unsupervised feature extraction, and classification. Different stages of the scheme are described in detail in the following sections.
It is impossible to conduct a long-time experiment, and therefore, only a small am data could be recorded. During the experiment, the recording structure might exp slight changes concerning electrode positions and impedances. Brain activities different states concerning motivation or fatigue across time. Thus, the experimen sign of MI-based BCI systems is very challenging [38].
The experimental procedure shows that the training and test trials are recorde two different days with an approximately one-week interval. During the BCI expe the patient, facing a monitor, is seated in a bed and is asked to repeatedly perf imagined movement of either the left small finger or the tongue. The 8 × 8 pl electrode grid is implanted on the contralateral (right) motor cortex of the patient hemisphere to record ECoG data. All ECoG data are recorded with 64 active elec The locations of the primary motor cortex are shown in Figure 2a. Figure 2b dep positions of 64 channels. All recording activities are performed with a sampling 1000 Hz. The ECoG dataset consists of a training dataset and a test dataset. The im tion duration starts with a cue that is presented in the form of a picture depict tasks. Each trial is recorded for 3 s, as illustrated in Figure 2c. The recorded d starts 0.5 s after the visual cue has ended, to avoid visually evoked potentials.

Method
The architecture of the proposed ECoG-based BCI system is summarized in F It contains three stages: preprocessing, unsupervised feature extraction, and cla tion. Different stages of the scheme are described in detail in the following section

Preprocessing
The preprocessing procedure, which is crucial for denoising the signal a aims to remove both high-frequency noise and low-frequency activities and quently to reduce the size of ECoG data and to remove artifacts. For this purpo ECoG is first downsampled to 100 Hz. Then, the signals are filtered between 0.5 Hz using a 5th order digital Butterworth filter. Finally, signals between 0.5 and exhibit the ERD/ERS phenomenon of MI tasks with reduced eye movement an tromyogram artifacts.

Preprocessing
The preprocessing procedure, which is crucial for denoising the signal analysis, aims to remove both high-frequency noise and low-frequency activities and subsequently to reduce the size of ECoG data and to remove artifacts. For this purpose, the ECoG is first downsampled to 100 Hz. Then, the signals are filtered between 0.5 and 30 Hz using a 5th order digital Butterworth filter. Finally, signals between 0.5 and 30 Hz exhibit the ERD/ERS phenomenon of MI tasks with reduced eye movement and electromyogram artifacts.

Feature Extraction
The purpose of this stage aims at extracting relevant features. These relevant features contain the time, frequency, and spatial characteristic properties of the ECoG signals and are suitable for MI tasks. We develop a CNN configuration to address the inherent structure of ECoG data and to obtain an optimal characterization of ECoG recordings from the right hemisphere of the brain, as well as the dynamics of the ERD/ERS phenomenon in the MI state.
CNN, which includes a feed-forward neural network that takes convolution as its core, is one of the most important concepts in deep learning. Mathematically, convolution is a function that is applied over the output of another function, and it is expressed as follows: Functions and two integrable functions over the field of real numbers. These two functions are integrated to produce a new function, which is called the convolution operation. It can be estimated as follows, where the f (t) and g(t) functions are both variables of convolution, τ is the integral variable, t is the amount of displacement of the function g(−τ), and " * " is defined as a convolution operator. In this way, with different values of t, this integral defines a new function called the convolution of the function f (t) and g(t).
A CNN model consists of a series of different layers, including the convolutional layer, the activation layer, the pooling layer, and the fully connected layer, etc. [39,40]. The schematic of the whole MI tasks recognition course is shown in Figure 4. tures contain the time, frequency, and spatial characteristic properties of nals and are suitable for MI tasks. We develop a CNN configuration to herent structure of ECoG data and to obtain an optimal characterizatio cordings from the right hemisphere of the brain, as well as the dynamics o phenomenon in the MI state.
CNN, which includes a feed-forward neural network that takes con core, is one of the most important concepts in deep learning. Mathemati tion is a function that is applied over the output of another function, and as follows: Functions and two integrable functions over the field of real n two functions are integrated to produce a new function, which is called t operation. It can be estimated as follows,

Convolutional Layer
The most important building block of a CNN is the convolutional la linear computing layer that uses a series of convolution kernels to conv ti-channel input data. The convolution kernel uses a sliding windo small-scale weighing operations at various positions of the input ECoG ECoG features. Feature maps can be obtained from the corresponding pr put data.
In the convolutional layer of Caffe (convolutional architecture for f bedding) [41], ECoG can be processed as follows. Given ECoG data i ∈ X

Convolutional Layer
The most important building block of a CNN is the convolutional layer, which is a linear computing layer that uses a series of convolution kernels to convolve with multichannel input data. The convolution kernel uses a sliding window to perform small-scale weighing operations at various positions of the input ECoG data to obtain ECoG features. Feature maps can be obtained from the corresponding processing of input data.
In the convolutional layer of Caffe (convolutional architecture for fast feature embedding) [41], ECoG can be processed as follows. Given ECoG data X i ∈ R C×T , where C is the number of recording channels, T stand for sample number and i = 1, . . . , N denotes the total number of trials. In our method, the ECoG data first needs to be converted into Caffe-readable data, with the size of N × M l × C l × T l , where M is the number of feature maps, and l is the number of layers.
The expression of the ECoG signals after the convolution layer is, where W(r) s represents the r-th convolution kernel of the s-th layer, X i represents the input variable of the s-th layer, ReLU is the activation function [42], and the bias value of the network is b. The mathematical expression ReLU is, We define the size of the input convolution kernel as h k × w k , where h k is the height of the input convolution kernel, and w k is the width of the kernel. The interval for convolution using the filter on the input data is h s × w s , where h s and w s are the distances in the vertical and horizontal directions, respectively. The data filled on the boundary of the input data are h p × w p , where h p and w p stand for the degrees of filling in the vertical and horizontal directions, respectively. The number of the output feature map of the current convolutional layer is m 1 , The output of the convolution layer is,

Pooling Layer
After the ECoG data pass through the convolution layer, we add one pooling layer, which is a non-linear computing layer. The goal is to subsample the input data to reduce the computational load, memory usage, and the number of parameters (thereby limiting the risk of overfitting). Similar to in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer; this serves to aggregate the inputs using an aggregation function such as the max or mean. In this experiment, we use a 1 × 3 max-pooling kernel, astride of 1 × 1, and no padding; note that only the max input value in each kernel continues to the next layer, while the other inputs are dropped.
In the Caffe architecture, the change of ECoG data in the pooling layer is the same as in the convolutional layer. The definition and operation of each parameter are the same as in the convolution layer, but the calculation of ECoG data is different: one method involves convolution operation in the convolution window, while the other concerns the maximum operation in the pooling window. Specifically, the size of the input pooling kernel is h k × w k , where h k and w k are the height and width of the pooling kernel. The interval for the pool using the filter on the input data is h s × w s , where h s and w s represent the distances in the vertical and horizontal directions, respectively. The number of the output feature map of the current pooling layer is m 2 . The input data of the pooling layer arise from the convolutional layer, and these outputs can be expressed as

Fully Connected Layer
Essentially, the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space. While the output from the convolutional layer could be flattened and connected to the output layer, adding a fully connected layer is a (usually) cheap way of learning non-linear combinations of these features. A fully connected layer is a linear computing layer that directly linearly transforms the input data, and we can divide the function of the fully connected layer into two parts. One is a feature extraction layer, while the other part is the final classification layer.
The fully connected layer can connect multidimensional vectors into a single featurelength vector. Each neuron in the layer is connected to all neurons in the previous layer. In the Caffe architecture, the final output of the ECoG data is that is, N × M f × 1 × 1, and after the data pass through the fully connected layer, the output is a single vector, and the size of the data becomes 1 × 1, where m 3 is the number of the output feature map of the current fully connected layer.

Classification
In the CNN model, the weight of the model cannot be adequately trained on a small dataset. After the last fully connected layer (as a classifier), applied to classify the test dataset, the accuracy of the CNN model can attain 89%. To make full use of the data features from the CNN model, we used gradient boosting (GB) [43] as a final classifier. GB is a machine learning technique for classification problems. It generates prediction models in the form of a set of weak predictive models. Similar to other boosting methods, such as AdaBoost and LogitBoost [44], it builds the model in a stage-wise fashion and allows optimization of any differentiable loss function.
We N denotes the total number of trials. The initial value of the classifier is F 0 = 0. After Z iterations, the classifier F Z will be continuously updated, Among them, the logarithmic regression model is: to p 0 (y i = 1|o i ) = 0.5 . The ordinary least squares (OLS) regression is used as the minimum loss function, and z = 1 : Z; the GB algorithm based on OLS regression can be expressed as follows, (1) To calculate the gradient of the loss function along the direction of the gradient descent, (2) OLS selects the best suitable gradient that uses the weak classifier J z (3) Now, calculating the weight of the weak classifier, Electronics 2021, 10, 112 8 of 13 (4) To improve the generalization performance of the algorithm, the J z is reduced by multiplying a small ε per step. A strong classifier is obtained by iteration, (5) Obtaining the new logarithmic regression value, see the Formula (8) Finally, the training and test data are input into the GB network to derive the accuracy of classification. The performance of the proposed method can be evaluated according to (13).

Accuracy =
The correct number o f trials The total number o f trials × 100% (13) Additionally, we further measure the performance of deep representation by introducing the information transfer rate (ITR) [45], which can incorporate accuracy and speed in a single value. This method is calculated by, where N is the class number and P is the classification accuracy.

Parameter Settings
In this experiment, the complete network can be divided into two parts. Feature extraction: The convolutional layers are serving the purpose of feature extraction. The CNN model captures the enhanced representation of data; hence, there is no need for feature engineering.
Classification: After feature extraction, we must classify the data into various classes, and this can be performed by using a fully connected neural network. In place of fully connected layers, we can also use a conventional classifier such as GB, k-nearest neighbor (KNN), Bayesian linear discriminate analysis (BLDA), support vector machines (SVM), and random forest (RF), etc. However, we generally end up adding GB to execute the classification procedure in this paper. For a CNN model, the number of convolution layers and the size of the convolution kernel are important factors that affect the performance of the convolutional neural network. These parameters will directly determine the correctness of feature extraction. Under the classical LeNet-5 framework, we improved its parameter settings to extract features from the ECoG signals. The LeNet-5 consists of three convolution layers, two pooling layers, and two fully connected layers. Based on that, the total number of convolutional layers and fully connected layers is the total number of layers of the network, and there are only two fully connected layers in each experiment group.
We developed two methods of 3, 4, 5, 6, and 7 network layer numbers, and 1 × 3, 1 × 5, 1 × 7, and 1 × 9 convolution kernels, respectively, to extract features, and we also calculated the classification accuracy with a fully connected classifier. Table 1 lists the specific experimental results. As can be seen from the above Table 1, the CNN model classification accuracies all reached 89% when we adopted six total layers with the convolution kernel size 1 × 3, six total layers with the convolution kernel size 1 × 5, and five total layers with the convolution kernel size 1 × 7. A comparison between different algorithms was performed when placing the feature data from three of the above CNN models into the GB classifier, among which the second was the highest (92%). We chose the network structure of six layers and the convolution kernel size of 1 × 5 as a final CNN model to extract data features. Specifically, the whole CNN network and data processing course in Caffe are shown in Figure 5.
tures. Specifically, the whole CNN network and data processing cours shown in Figure 5. 1 × 9 89% 92% 94% 9 Figure 5. The whole CNN network structure and data processing course in Caffe.

The CNN Features Visualization
The extracted CNN features further enhance the strength of EC shown in Figure 6. We create spectrograms from raw ECoG signals and respectively. Figure 6a shows a raw signal during an average of all the same kind of MI tasks. Figure 6b illustrates a visual of the ECoG signal s 6c shows the strength of CNN features. It is worth noting that Figure 6b,c with the calculation of the average of all samples in the same kind of MI ily, ECoG is a low-frequency signal. These strong ECoG signals are dist frequency, as shown in Figure 6b. This characteristic makes the ECoG vu ing disturbed by external factors during the processing procedure. After work, we can see that the distribution of deep representation strength is 6c), which exactly reflects the effectiveness of the CNN network. Furthe tasks (left pinky and tongue) in Figure 6c are more readily identifiable tha (a)

The CNN Features Visualization
The extracted CNN features further enhance the strength of ECoG signals, as shown in Figure 6. We create spectrograms from raw ECoG signals and CNN features, respectively. Figure 6a shows a raw signal during an average of all the samples in the same kind of MI tasks. Figure 6b illustrates a visual of the ECoG signal strength. Figure 6c shows the strength of CNN features. It is worth noting that Figure 6b,c are performed with the calculation of the average of all samples in the same kind of MI tasks. Ordinarily, ECoG is a low-frequency signal. These strong ECoG signals are distributed in low frequency, as shown in Figure 6b. This characteristic makes the ECoG vulnerable to being disturbed by external factors during the processing procedure. After the CNN network, we can see that the distribution of deep representation strength is wider (Figure 6c), which exactly reflects the effectiveness of the CNN network. Furthermore, two MI tasks (left pinky and tongue) in Figure 6c are more readily identifiable than in Figure 6b.

The CNN Features Visualization
The extracted CNN features further enhance the strength of EC shown in Figure 6. We create spectrograms from raw ECoG signals and respectively. Figure 6a shows a raw signal during an average of all the same kind of MI tasks. Figure 6b illustrates a visual of the ECoG signal s 6c shows the strength of CNN features. It is worth noting that Figure 6b,c with the calculation of the average of all samples in the same kind of MI ily, ECoG is a low-frequency signal. These strong ECoG signals are dis frequency, as shown in Figure 6b. This characteristic makes the ECoG v ing disturbed by external factors during the processing procedure. Afte work, we can see that the distribution of deep representation strength i 6c), which exactly reflects the effectiveness of the CNN network. Furthe tasks (left pinky and tongue) in Figure 6c are more readily identifiable tha (a)

The Comparison of Experimental Results
In this paper, CNN works as a trainable feature extractor and GB pe recognizer. This hybrid model automatically extracts features from the raw and generates the predictions. The final classification accuracy can reach 92 our proposed model, as shown in Figure 7a.

The Comparison of Experimental Results
In this paper, CNN works as a trainable feature extractor and GB performs as a recognizer. This hybrid model automatically extracts features from the raw ECoG data and generates the predictions. The final classification accuracy can reach 92% based on our proposed model, as shown in Figure 7a.

The Comparison of Experimental Results
In this paper, CNN works as a trainable feature extractor and GB performs as a recognizer. This hybrid model automatically extracts features from the raw ECoG data and generates the predictions. The final classification accuracy can reach 92% based on our proposed model, as shown in Figure 7a. The classifiers include GB, Bayesian linear discriminate analysis (BLDA), KNN, and SVM classifiers. The accuracies of different algorithms vary from 89% to 92%. The deep representation with GB classifier can achieve the best performance. Furthermore, Figure  7 also shows that deep representation can obtain higher ITR.
Finally, the algorithm proposed in this paper is compared with other methods. The competition winner got the accuracy of 91% by employing the combination features in- The classifiers include GB, Bayesian linear discriminate analysis (BLDA), KNN, and SVM classifiers. The accuracies of different algorithms vary from 89% to 92%. The deep representation with GB classifier can achieve the best performance. Furthermore, Figure 7 also shows that deep representation can obtain higher ITR.
Finally, the algorithm proposed in this paper is compared with other methods. The competition winner got the accuracy of 91% by employing the combination features including band power, common spatial subspace decomposition (CSSD), and mean waveform mean [46]. Using the common spatial pattern (CSP) as a trainable feature extractor and with SVM performing as a classifier, its accuracy reaches 84% [47], and the goal of the CSP algorithm is to find a set of optimal spatial filters for projection and to obtain a higher resolution eigenvector. Extracting data features from the wavelet transform (WT) and using the probabilistic neural network (PNN) as a classifier can enable the accuracy of 88% [48]. The power features are extracted by relative wavelet energy (RWE), and the used PNN classifier can get an accuracy of 91.8% [49]. Xu F. et al. (2014) developed a modified s-transform (MST) algorithm, which is an improved method based on the s-transform algorithm. It may achieve 92% classification accuracy by the MST feature extraction algorithm with the GB classifier [50]. The result is shown in Table 2. The classification accuracy of the method proposed in this paper is higher than that of other references. Under the same classification accuracy, the computational complexity is higher when using the MST than the CNN. Moreover, the CNN algorithm accelerates by using GPUs, requiring less time, and running more quickly.

Conclusions
A novel deep representation m0ethod that exploits the inherent characteristics of MI-based ECoG is introduced in this paper. The CNN algorithm is introduced to learn representation from ECoG signals, and then the deep representation is fed into the traditional GB classifier. The better classification accuracy and higher ITR demonstrate the effectiveness of the proposed combinational algorithm. Additionally, we show the performance of the system under different CNN network structures. This system can realize high-speed real-time arithmetic depend on Caffe and GPUs.