Data Augmentation for Motor Imagery Signal Classification Based on a Hybrid Neural Network

As an important paradigm of spontaneous brain-computer interfaces (BCIs), motor imagery (MI) has been widely used in the fields of neurological rehabilitation and robot control. Recently, researchers have proposed various methods for feature extraction and classification based on MI signals. The decoding model based on deep neural networks (DNNs) has attracted significant attention in the field of MI signal processing. Due to the strict requirements for subjects and experimental environments, it is difficult to collect large-scale and high-quality electroencephalogram (EEG) data. However, the performance of a deep learning model depends directly on the size of the datasets. Therefore, the decoding of MI-EEG signals based on a DNN has proven highly challenging in practice. Based on this, we investigated the performance of different data augmentation (DA) methods for the classification of MI data using a DNN. First, we transformed the time series signals into spectrogram images using a short-time Fourier transform (STFT). Then, we evaluated and compared the performance of different DA methods for this spectrogram data. Next, we developed a convolutional neural network (CNN) to classify the MI signals and compared the classification performance of after DA. The Fréchet inception distance (FID) was used to evaluate the quality of the generated data (GD) and the classification accuracy, and mean kappa values were used to explore the best CNN-DA method. In addition, analysis of variance (ANOVA) and paired t-tests were used to assess the significance of the results. The results showed that the deep convolutional generative adversarial network (DCGAN) provided better augmentation performance than traditional DA methods: geometric transformation (GT), autoencoder (AE), and variational autoencoder (VAE) (p < 0.01). Public datasets of the BCI competition IV (datasets 1 and 2b) were used to verify the classification performance. Improvements in the classification accuracies of 17% and 21% (p < 0.01) were observed after DA for the two datasets. In addition, the hybrid network CNN-DCGAN outperformed the other classification methods, with average kappa values of 0.564 and 0.677 for the two datasets.


Introduction
A brain-computer interface (BCI) is a communication method between a user and a computer that does not rely on the normal neural pathways of the brain and muscles [1]. Electroencephalogram (EEG) signals are widely used as a BCI input because the method is non-invasive, cheap, and convenient. The generation of EEG signals can be divided into two types: active induction, such as motor imagery of the four subjects (b, d, e, and g) were used for the analysis. The experimental process is shown in Figure 1. The sampling frequency of this experiment was 100 Hz, and each subject underwent 200 trials, resulting in 800 trials for the four subjects as the training and test data. We used EEG signals from three channels (C3, Cz, and C4).

Datasets
We selected two datasets [42] for MI classification to validate our methods. First, we chose the BCI competition IV data set 1 as the training and test data set. This data set was provided by the BCI Research Institute in Berlin and contained two parts: the standard set and the evaluation set. The data of the four subjects (b, d, e, and g) were used for the analysis. The experimental process is shown in Figure 1. The sampling frequency of this experiment was 100 Hz, and each subject underwent 200 trials, resulting in 800 trials for the four subjects as the training and test data. We used EEG signals from three channels (C3, Cz, and C4). The second dataset included the data from nine subjects from the BCI competition IV data set 2b. Three channels (C3, Cz, and C4) were used to record the EEG signals using a 250 Hz sampling rate. Each subject underwent 120 trials in 1-2 sessions and 160 trials in 3-5 sessions. We used five sessions for 720 × 9 trials for all subjects. The experimental process is shown in Figure 2.
The number of trials in each subject class was the same for both datasets. We filtered the 8-30 Hz signals using a Butterworth filter before analysis.

Preprocessing of the Raw Data
MI can cause ERD in the contralateral motor cortex and ERS in the ipsilateral cortex; these phenomena are reflected in changes in the energy of different frequency bands [43]. However, timeseries signals cannot describe the features of these conditions. One promising method is a timefrequency transform, which expands the signal in two dimensions. A short-time Fourier transform (STFT) [44] is commonly used, in which a time-frequency localized window function is used for the transformation. The energy characteristics can be detected using a sliding window function that transforms the signals [45] because C3, C4, and Cz represent the dynamical change in the EEG of the MI [46]. Therefore, these three channels were used for the analysis.
As shown in Figure 3, the three channels were converted into a two-dimensional form and were mosaicked into an image using vertical stacking. For each image, the color depth indicates the signal energy of the different bands, the color change trend in the x-axis direction represents the time series, The second dataset included the data from nine subjects from the BCI competition IV data set 2b. Three channels (C3, Cz, and C4) were used to record the EEG signals using a 250 Hz sampling rate. Each subject underwent 120 trials in 1-2 sessions and 160 trials in 3-5 sessions. We used five sessions for 720 × 9 trials for all subjects. The experimental process is shown in Figure 2.

Datasets
We selected two datasets [42] for MI classification to validate our methods. First, we chose the BCI competition IV data set 1 as the training and test data set. This data set was provided by the BCI Research Institute in Berlin and contained two parts: the standard set and the evaluation set. The data of the four subjects (b, d, e, and g) were used for the analysis. The experimental process is shown in Figure 1. The sampling frequency of this experiment was 100 Hz, and each subject underwent 200 trials, resulting in 800 trials for the four subjects as the training and test data. We used EEG signals from three channels (C3, Cz, and C4). The second dataset included the data from nine subjects from the BCI competition IV data set 2b. Three channels (C3, Cz, and C4) were used to record the EEG signals using a 250 Hz sampling rate. Each subject underwent 120 trials in 1-2 sessions and 160 trials in 3-5 sessions. We used five sessions for 720 × 9 trials for all subjects. The experimental process is shown in Figure 2.
The number of trials in each subject class was the same for both datasets. We filtered the 8-30 Hz signals using a Butterworth filter before analysis.

Preprocessing of the Raw Data
MI can cause ERD in the contralateral motor cortex and ERS in the ipsilateral cortex; these phenomena are reflected in changes in the energy of different frequency bands [43]. However, timeseries signals cannot describe the features of these conditions. One promising method is a timefrequency transform, which expands the signal in two dimensions. A short-time Fourier transform (STFT) [44] is commonly used, in which a time-frequency localized window function is used for the transformation. The energy characteristics can be detected using a sliding window function that transforms the signals [45] because C3, C4, and Cz represent the dynamical change in the EEG of the MI [46]. Therefore, these three channels were used for the analysis.
As shown in Figure 3, the three channels were converted into a two-dimensional form and were mosaicked into an image using vertical stacking. For each image, the color depth indicates the signal energy of the different bands, the color change trend in the x-axis direction represents the time series, The number of trials in each subject class was the same for both datasets. We filtered the 8-30 Hz signals using a Butterworth filter before analysis.

Preprocessing of the Raw Data
MI can cause ERD in the contralateral motor cortex and ERS in the ipsilateral cortex; these phenomena are reflected in changes in the energy of different frequency bands [43]. However, time-series signals cannot describe the features of these conditions. One promising method is a time-frequency transform, which expands the signal in two dimensions. A short-time Fourier transform (STFT) [44] is commonly used, in which a time-frequency localized window function is used for the transformation. The energy characteristics can be detected using a sliding window function that transforms the signals [45] because C3, C4, and Cz represent the dynamical change in the EEG of the MI [46]. Therefore, these three channels were used for the analysis.
As shown in Figure 3, the three channels were converted into a two-dimensional form and were mosaicked into an image using vertical stacking. For each image, the color depth indicates the signal energy of the different bands, the color change trend in the x-axis direction represents the time series, and the color change trend in the y-axis direction reflects the characteristics of the different frequency bands. STFT was applied to the time series for 4 s trials (during imagery period), with window sizes equal to 128 and 256 for the two datasets, respectively. Due to the difference in sampling rate, the sample sizes of the two datasets were 400 and 1000. Meanwhile, the frequency bands between 8 Sensors 2020, 20, 4485 5 of 20 and 30 Hz were considered to represent motion-related bands. The process was repeated for three electrodes, which were C3, Cz, and C4. The results were vertically stacked in a way that the channel's neighboring information was preserved. Finally, all spectrogram images were resized to 64 × 64 after the transformation for convenience and consistency in the subsequent calculations.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 21 and the color change trend in the y-axis direction reflects the characteristics of the different frequency bands. STFT was applied to the time series for 4 s trials (during imagery period), with window sizes equal to 128 and 256 for the two datasets, respectively. Due to the difference in sampling rate, the sample sizes of the two datasets were 400 and 1000. Meanwhile, the frequency bands between 8 and 30 Hz were considered to represent motion-related bands. The process was repeated for three electrodes, which were C3, Cz, and C4. The results were vertically stacked in a way that the channel's neighboring information was preserved. Finally, all spectrogram images were resized to 64 × 64 after the transformation for convenience and consistency in the subsequent calculations.

Different Data Augmentation Models
DA has been demonstrated to improve the performance of pattern recognition models in the computer vision field [47]. DA increases the complexity of the training model and reduces overfitting by adding artificial data. In this study, we compared the performance of different DA methods for MI classification using a DNN. In the following section, we briefly introduce the different data methods used in our research.

Geometric Transformation (GT)
GT is an effective method that changes the geometry of the data. The method preserves the characteristics of the data and increases the diversity of the representation [48]. As shown in Figure 4 we used three GT methods for the DA of the MI signals: (1) Rotate the image 180° right or left on the x-axis (rotation); (2) Shift the images left, right, up, or down; the remaining space is filled with random noise (translation); (3) Perform augmentations in the color space (color-space transformation).

Noise Addition (NA)
NA refers to the addition of random values to the raw data using a Gaussian distribution. Francisco et al. [49] demonstrated that NA significantly improves the performance and robustness of a model. A standard random uniform noise procedure was implemented to augment the raw data. The calculation is shown in the following equation:

Different Data Augmentation Models
DA has been demonstrated to improve the performance of pattern recognition models in the computer vision field [47]. DA increases the complexity of the training model and reduces overfitting by adding artificial data. In this study, we compared the performance of different DA methods for MI classification using a DNN. In the following section, we briefly introduce the different data methods used in our research.

Geometric Transformation (GT)
GT is an effective method that changes the geometry of the data. The method preserves the characteristics of the data and increases the diversity of the representation [48]. As shown in Figure 4 we used three GT methods for the DA of the MI signals: (1) Rotate the image 180 • right or left on the x-axis (rotation); (2) Shift the images left, right, up, or down; the remaining space is filled with random noise (translation); (3) Perform augmentations in the color space (color-space transformation).
Sensors 2020, 20, x FOR PEER REVIEW 5 of 21 and the color change trend in the y-axis direction reflects the characteristics of the different frequency bands. STFT was applied to the time series for 4 s trials (during imagery period), with window sizes equal to 128 and 256 for the two datasets, respectively. Due to the difference in sampling rate, the sample sizes of the two datasets were 400 and 1000. Meanwhile, the frequency bands between 8 and 30 Hz were considered to represent motion-related bands. The process was repeated for three electrodes, which were C3, Cz, and C4. The results were vertically stacked in a way that the channel's neighboring information was preserved. Finally, all spectrogram images were resized to 64 × 64 after the transformation for convenience and consistency in the subsequent calculations.

Different Data Augmentation Models
DA has been demonstrated to improve the performance of pattern recognition models in the computer vision field [47]. DA increases the complexity of the training model and reduces overfitting by adding artificial data. In this study, we compared the performance of different DA methods for MI classification using a DNN. In the following section, we briefly introduce the different data methods used in our research.

Geometric Transformation (GT)
GT is an effective method that changes the geometry of the data. The method preserves the characteristics of the data and increases the diversity of the representation [48]. As shown in Figure 4 we used three GT methods for the DA of the MI signals: (1) Rotate the image 180° right or left on the x-axis (rotation); (2) Shift the images left, right, up, or down; the remaining space is filled with random noise (translation); (3) Perform augmentations in the color space (color-space transformation).

Noise Addition (NA)
NA refers to the addition of random values to the raw data using a Gaussian distribution. Francisco et al. [49] demonstrated that NA significantly improves the performance and robustness of a model. A standard random uniform noise procedure was implemented to augment the raw data. The calculation is shown in the following equation:

Noise Addition (NA)
NA refers to the addition of random values to the raw data using a Gaussian distribution. Francisco et al. [49] demonstrated that NA significantly improves the performance and robustness of a model. A standard random uniform noise procedure was implemented to augment the raw data. The calculation is shown in the following equation: x = x + random(−0.5, 0.5) * noise.
In our study, we randomly added Gaussian noise to the MI spectrogram data ( Figure 5).

Generative Model
Generative models use artificial data with features similar to that of the raw data; these models have a powerful feature mapping ability and provide a good representation of the original data. In this study, we evaluated the performance of three different generative models.

a. Autoencoder (AE)
A useful strategy for generative modeling involves an autoencoder (AE). As shown in Figure 6, an AE is a feed-forward neural network that is used for data dimensionality reduction, feature extraction, and model generation. The network contains two parts: the encoder = ( ) is used to compress the input data, and the decoder = ( ) restores the data that contains useful features. Variational autoencoders (VAEs) and AEs have a similar structure, but VAEs include constraints on the encoder to ensure that the output of the AE has a particular distribution and good robustness. A VAE can be defined as a directed model that uses learned approximate inferences [50]. To generate new data using a VAE, an encoder is used to obtain the hidden variable z, and the decoder then generates new data x. During training, the hidden variable learns the probability distribution from the input. In this study, we used the AE ( Figure 6) and VAE ( Figure 7) models described in Ref [51].

Generative Model
Generative models use artificial data with features similar to that of the raw data; these models have a powerful feature mapping ability and provide a good representation of the original data. In this study, we evaluated the performance of three different generative models.

a. Autoencoder (AE)
A useful strategy for generative modeling involves an autoencoder (AE). As shown in Figure 6, an AE is a feed-forward neural network that is used for data dimensionality reduction, feature extraction, and model generation. The network contains two parts: the encoder z = f (x) is used to compress the input data, and the decoder r = g(z) restores the data that contains useful features.
In our study, we randomly added Gaussian noise to the MI spectrogram data ( Figure 5).

Generative Model
Generative models use artificial data with features similar to that of the raw data; these models have a powerful feature mapping ability and provide a good representation of the original data. In this study, we evaluated the performance of three different generative models.

a. Autoencoder (AE)
A useful strategy for generative modeling involves an autoencoder (AE). As shown in Figure 6, an AE is a feed-forward neural network that is used for data dimensionality reduction, feature extraction, and model generation. The network contains two parts: the encoder = ( ) is used to compress the input data, and the decoder = ( ) restores the data that contains useful features. Variational autoencoders (VAEs) and AEs have a similar structure, but VAEs include constraints on the encoder to ensure that the output of the AE has a particular distribution and good robustness. A VAE can be defined as a directed model that uses learned approximate inferences [50]. To generate new data using a VAE, an encoder is used to obtain the hidden variable z, and the decoder then generates new data x. During training, the hidden variable learns the probability distribution from the input. In this study, we used the AE ( Figure 6) and VAE ( Figure 7) models described in Ref [51]. Variational autoencoders (VAEs) and AEs have a similar structure, but VAEs include constraints on the encoder to ensure that the output of the AE has a particular distribution and good robustness. A VAE can be defined as a directed model that uses learned approximate inferences [50]. To generate new data using a VAE, an encoder is used to obtain the hidden variable z, and the decoder then generates new data x. During training, the hidden variable learns the probability distribution from the input. In this study, we used the AE ( Figure 6) and VAE ( Figure 7) models described in Ref. [51].

c. Deep Convolutional Generative Adversarial Networks (DCGANs)
Another type of generative model for DA is a GAN. Goodfellow et al. originally proposed the GAN for data generation and conducted qualitative and quantitative evaluations of the GAN model by comparing it with deep learning networks and overlapping self-encoders [52]. A GAN uses the competition between two networks to achieve a dynamic balance to learn the statistical distribution of the target data. The generator first initializes a random noise vector p z and learns the distribution P x of the target parameter X by fitting a differentiable function to approximate G(z; θ G ). The discriminator uses the differentiable function approximator D() to predict the input variables from the actual target data distribution P x and not from the generated function. The optimization goal of the framework is to minimize the mean square error between the generated sample prediction label and the real sample label. The generator is trained to minimize the function log(1 − D(G(z; θ G )). Hence, the optimization problem of the GAN can be defined as: where V represents the value function and E represents the expected value. x is the RD, z is the random noise vector, and P(·) is the distribution. The discriminator aims to distinguish whether the generated data are real or not. Thus, cross-entropy is adopted as the loss for this binary classification: During the training of GANs, the objective is to find the Nash equilibrium of a non-convex game with continuous, high-dimensional parameters. GANs are typically trained using gradient descent techniques to determine the minimum value of a cost function. The GAN learns the feature representation without requiring a cost function, but this may result in instability during training, which often generates a meaningless output [53]. To address this problem, many researchers have proposed various morphing shapes. In the field of image processing, the DCGAN was proposed [54], and the authors focused on the topology of the DCGAN to ensure stability during training. The discriminator creates filters based on the CNN learning process and ensures that the filters learn useful features of the target image. This generator determines the feature quality of the generated image to ensure the diversity of the generated samples. Since the DCGAN shows excellent performance for image features in hidden space [55], we chose the DCGAN to generate the EEG images. The DCGAN differs from the GAN in the following model structure: The pooling layer is replaced by fractional-strided convolutions in the generator and by strided convolutions in the discriminator.

2.
Batch normalization is used in the generator and discriminator, and there is no fully connected layer. 3.
In the generator, all layers except for the output use the rectified linear unit (ReLU) as an activation function; the output layer use tanh. 4.
All layers use the leaky ReLU as the action function in the discriminator.
In this study, we referred to the structure of DCGAN in Cubuk et al. [48] and implemented it as a baseline; the generator and discriminator networks were extended to capture more relevant features from the MI-EEG datasets. The detail of the network structure is described in the following.

Generator Model
Due to the weakness and non-stationary nature of the features, a generator is necessary to create high precision. To guarantee the performance of DA, the generator model should maintain a balanced Sensors 2020, 20, 4485 8 of 20 condition between the discriminator and the generator. As shown in Figure 8, a six-layer network was proposed in our study.

Generator Model
Due to the weakness and non-stationary nature of the features, a generator is necessary to create high precision. To guarantee the performance of DA, the generator model should maintain a balanced condition between the discriminator and the generator. As shown in Figure 8, a six-layer network was proposed in our study. A three-channel RGB spectrogram MI image was generated by a random vector using the generator. The operation of up-sampling and convolution guaranteed the output was consistent with the original training dataset. The number of channels of each deconvolution layer was halved, and the output tensor was doubled. Finally, the last generated image was output by the tanh activation layer. Details of the generator are summarized in Table 2.

Discriminator Model
As shown in Figure 9, the discriminator network consisted of a deep convolution network that aimed to distinguish whether the generated image came from the training data or the generator. Details of the discriminator are summarized in Table 3. A three-channel RGB spectrogram MI image was generated by a random vector using the generator. The operation of up-sampling and convolution guaranteed the output was consistent with the original training dataset. The number of channels of each deconvolution layer was halved, and the output tensor was doubled. Finally, the last generated image was output by the tanh activation layer. Details of the generator are summarized in Table 2.

Discriminator Model
As shown in Figure 9, the discriminator network consisted of a deep convolution network that aimed to distinguish whether the generated image came from the training data or the generator. Details of the discriminator are summarized in Table 3.
"Adam" was used as the optimizer with the following parameters: learning rate = 2 × 10 −4 , batch size = 128, and training epoch = 20. For every subject in the two datasets, we used a 10-fold cross-validation to divide the data and train the network. The network structure of the DCGAN is shown in Figure 10.  "Adam" was used as the optimizer with the following parameters: learning rate = 2 × 10 −4 , batch size = 128, and training epoch = 20. For every subject in the two datasets, we used a 10-fold crossvalidation to divide the data and train the network. The network structure of the DCGAN is shown in Figure 10.    Figure 9. The structure of the discriminator. "Adam" was used as the optimizer with the following parameters: learning rate = 2 × 10 −4 , batch size = 128, and training epoch = 20. For every subject in the two datasets, we used a 10-fold crossvalidation to divide the data and train the network. The network structure of the DCGAN is shown in Figure 10.

Performance Verification of the Data Augmentation
It is well known that the clarity and diversity of the GD are important evaluation indicators. Researchers conducted a systematic review of the quality evaluation of the GD [56]. For image data, visualization is a reliable method because problems can be easily detected in the GD. However, this method does not provide quantitative indicators of the quality of the GD. The inception score is a commonly used quantitative index of the quality of GD. This method assesses the accuracy of the GD using an inception network. The FID is an improved version of the inception score and includes the probability distribution and a similarity measure between the GD and RD [53]. In this method, the features of the data are extracted using the inception network [57], and a Gaussian model is used to conduct spatial modeling of the features. The FID is calculated according to the mean value and covariance of the Gaussian model: where r represents the RD, g represents the GD, and Tr is the trace of the matrix. A small FID value indicates a high similarity between the GD and RD and a good DA performance. We compared the augmentation performance of the DCGAN with those of the GT, NA, and other generative models.

Evaluation of the MI Classification Performance after the Augmentation
It is expected that a good DA performance improves the performance of the classifier, especially for classification models based on a DNN, which is sensitive to the size of the dataset. CNNs are often used in image classification tasks and result in a good performance. CNNs often provide better performance than traditional methods for the processing of EEG signals [58][59][60].
A CNN is a multi-layered neural network consisting of a sequence of convolution, pooling, and fully connected layers. Each neuron is connected to the previous feature map by the convolution kernel. The convolution layer extracts the features of the input image using the kennel size, and the pooling layer is located between the continuous convolution layers to compress the data and parameters and reduce overfitting. More advanced features can be extracted with a larger number of layers. The fully connected layer transforms the output matrix from the last layer to an n-dimensional vector (n is the number of classes) to predict the distribution of the different classes. Backpropagation is utilized to decrease the classification error.
In the convolution layer, the input image can be convolved with a spatial filter to form the feature map and output function, which is expressed as: This formula describes the jth feature map in layer l, where X l j is calculated using the previous feature map X l−1 i multiplied by the convolution kernel W l ij and adding a bias parameter b l j . Finally, the mapping is completed using the ReLU function f (a): The pooling layer is sandwiched in the continuous convolution layer to compress the amount of data and parameters and reduce overfitting. The max-pooling method was chosen in this work as follows: X l j,k = max 0≤m,n≤s X l−1 j·s+m,k·s+n .
where j and k are the locations of the current feature map X l j and s stands for pooling size. The double fully connected layer structure can effectively translate the multi-scale features of the image. Considering the multiple influencing factors of time, frequency, and channel, this study used double fully connected layers to improve the performance gain of the softmax layer. Two-way softmax in the last layer in the deep networks was used to predict the distribution of the two motor imagery tasks: where x i is the ith feature map and y i represents an output probability distribution. The gradient of the backpropagation was calculated according to the cross-entropy loss function: Furthermore, we used the stochastic gradient descent (SGD) optimizer with a learning rate of 1 × 10 −4 to improve the speed of the network training: where µ is the learning rate, W k represents the weight matrix for kernel k, and b k represents the bias value. E represents the difference between the desired output and the real output. In our study, an eight-layer neural network structure was used to classify the two-class MI signals ( Figure 11).
Sensors 2020, 20, x FOR PEER REVIEW 11 of 21 = (∑ · , + ) ∑ exp (∑ · , + ) , where is the ith feature map and represents an output probability distribution. The gradient of the backpropagation was calculated according to the cross-entropy loss function: Furthermore, we used the stochastic gradient descent (SGD) optimizer with a learning rate of 1 × 10 −4 to improve the speed of the network training: where is the learning rate, represents the weight matrix for kernel k, and represents the bias value.
represents the difference between the desired output and the real output. In our study, an eight-layer neural network structure was used to classify the two-class MI signals ( Figure 11). Considering the multiple influencing factors of time, frequency, and channel, we used two fully connected layers to improve the performance gain of the softmax layer [58]. The gradient of the backpropagation was calculated using the cross-entropy loss function, and we used a stochastic gradient descent with momentum (SGDM) optimizer with a learning rate of 1 × 10 −4 to improve the speed of network training. To reduce computation time and prevent overfitting, we adopted the dropout operation. The parameters of the proposed CNN model are summarized in Table 4:  Considering the multiple influencing factors of time, frequency, and channel, we used two fully connected layers to improve the performance gain of the softmax layer [58]. The gradient of the backpropagation was calculated using the cross-entropy loss function, and we used a stochastic gradient descent with momentum (SGDM) optimizer with a learning rate of 1 × 10 −4 to improve the speed of network training. To reduce computation time and prevent overfitting, we adopted the dropout operation. The parameters of the proposed CNN model are summarized in Table 4: The average classification accuracy and kappa value were used as evaluation criteria to compare the performances of all methods. We divided the RD into training data and test data using 10-fold cross-validation [61]. In each dataset, 90% of the trials combined with the GD were selected randomly as the training set, and the remaining 10% of the RD was used as the test set. This operation was repeated 10 times.
The kappa value is a well-known method for evaluating EEG classifications because it removes the influence of random errors. It is calculated as: We determined the optimal ratio of the GD and RD by comparing the classification accuracies of different ratios of the GD and RD.

Results of the Freéchet Inception Distances for Different Data Augmentation Methods
In this experiment, we used five DA methods to generate artificial MI-EEG data. We executed data augmentation based on a spectrogram MI signal (Section 2.2) for each subject independently. Furthermore, there were 200 trials for one subject in dataset 1 and 720 trials for one subject in dataset 2b. As for the GT and NA methods, all trials from one subject were randomly sampled for training. Meanwhile, the 10-fold cross-validation strategy was used to train the generated model for AE, VAE, and DCGAN. The quality of the GD was assessed using the FID, which is the probability distance between the two distributions. A lower value represents a better DA performance. As shown in Table 5, the data generated by the GT were considerably different from the RD. The quality of the data generated by the DCGAN was significantly higher than that of the other models, although the FID results were not ideal. Among the three DA methods based on generative models, the score of dataset 2b was better than that of dataset 1. Some possible explanations are listed in the following:

1.
One subject for each of 200 trials and 720 trials in datasets 1 and 2b, respectively. A larger-scale training data improved the robustness and generalization of the model.

2.
Due to the difference in sampling rate, the sample sizes of the two datasets were 400 and 1000 (datasets 1 and 2b, respectively). More samples would be helpful to improve the resolution of the spectrogram.

3.
During the experimental process, dataset 2b designed the cue-based screening paradigm that aimed to enhance the attention of the subjects before imagery. However, there was no similar set in dataset 1. This setting may lead to a more consistent feature distribution and higher quality for MI spectrogram data. In summary, the sampling rate, design of the paradigm, and the dataset scale could obviously influence the quality of the generated data. Figure 12a,b shows the analysis of variance (ANOVA) statistics of the different methods for the BCI Competition IV datasets 1 and 2b, respectively. There were statistically significant differences between the different DA methods. To compare the effects of different DA methods, we show different generated spectrogram MI data in Figure 13.

Classification Performance of Different Data Augmentation Methods
We used the average classification accuracy and mean kappa value to evaluate both datasets. First, we determined the classification accuracies using DA. The results of the classification accuracy and standard deviation are shown in Tables 6 and 7, and the kappa value results and standard deviations of the methods are presented in Tables 4 and 5. The average classification accuracies of the CNN methods without DA were 74.5 ± 4.0% and 80.6 ± 3.2% for datasets 1 and 2b, respectively (baseline). The NA-CNN, VAE-CNN, and DCGAN-CNN provided higher accuracies than the baseline for both datasets (Tables 2 and 3). The results of the different ratios of RD and GD indicated no positive correlation between the accuracy and the proportion of training data from the GD. In this study, the ratio of 1:3 (RD:GD) provided the optimal DA performance. The average classification accuracy of the CNN-DCGAN was 12.6% higher than the baseline for dataset 2b and 8.7% higher than the baseline for dataset 1. We also noticed that none of the ratios provided satisfactory results for the CNN-GT model. One possible explanation is that the rotation may have adversely affected the information in the EEG channel, resulting in incorrect labels.  Table 7. Classification Accuracy of the methods for the BCI competition IV dataset 2b (baseline: 80.6 ± 3.2%). To compare the effects of different DA methods, we show different generated spectrogram MI data in Figure 13. To compare the effects of different DA methods, we show different generated spectrogram MI data in Figure 13.

Classification Performance of Different Data Augmentation Methods
We used the average classification accuracy and mean kappa value to evaluate both datasets. First, we determined the classification accuracies using DA. The results of the classification accuracy and standard deviation are shown in Tables 6 and 7, and the kappa value results and standard deviations of the methods are presented in Tables 4 and 5. The average classification accuracies of the CNN methods without DA were 74.5 ± 4.0% and 80.6 ± 3.2% for datasets 1 and 2b, respectively (baseline). The NA-CNN, VAE-CNN, and DCGAN-CNN provided higher accuracies than the baseline for both datasets (Tables 2 and 3). The results of the different ratios of RD and GD indicated no positive correlation between the accuracy and the proportion of training data from the GD. In this study, the ratio of 1:3 (RD:GD) provided the optimal DA performance. The average classification accuracy of the CNN-DCGAN was 12.6% higher than the baseline for dataset 2b and 8.7% higher than the baseline for dataset 1. We also noticed that none of the ratios provided satisfactory results for the CNN-GT model. One possible explanation is that the rotation may have adversely affected the information in the EEG channel, resulting in incorrect labels.  Table 7. Classification Accuracy of the methods for the BCI competition IV dataset 2b (baseline: 80.6 ± 3.2%).

Classification Performance of Different Data Augmentation Methods
We used the average classification accuracy and mean kappa value to evaluate both datasets. First, we determined the classification accuracies using DA. The results of the classification accuracy and standard deviation are shown in Tables 6 and 7, and the kappa value results and standard deviations of the methods are presented in Tables 4 and 5. The average classification accuracies of the CNN methods without DA were 74.5 ± 4.0% and 80.6 ± 3.2% for datasets 1 and 2b, respectively (baseline). The NA-CNN, VAE-CNN, and DCGAN-CNN provided higher accuracies than the baseline for both datasets (Tables 2 and 3). The results of the different ratios of RD and GD indicated no positive correlation between the accuracy and the proportion of training data from the GD. In this study, the ratio of 1:3 (RD:GD) provided the optimal DA performance. The average classification accuracy of the CNN-DCGAN was 12.6% higher than the baseline for dataset 2b and 8.7% higher than the baseline for dataset 1. We also noticed that none of the ratios provided satisfactory results for the CNN-GT model. One possible explanation is that the rotation may have adversely affected the information in the EEG channel, resulting in incorrect labels. Table 6. Classification accuracy of the methods for the BCI competition IV dataset 1 (baseline: 74.5 ± 4.0%).

Ratio
Accuracy% (Mean ± std. dev.)  The mean kappa value of the CNN-DCGAN was the highest among the methods, indicating that the DCGAN obtained sufficient knowledge of the features of the EEG spectrogram. As shown in Tables 8  and 9, the performance of the three generative models was superior to that of the other DA methods. In addition, the standard deviation of the kappa value was relatively small, indicating the good stability and robustness of this method. Regardless of the RD:GD ratio, the results of the CNN-DCGAN showed a high degree of consistency for the average classification accuracy. Overall, the results demonstrated that this strategy provided the most stable and accurate classification performance. ANOVA and paired t-tests were performed. We compared the CNN-DCGAN with other CNN-DA to determine the optimal DA method (with the optimal ratio) and compared the CNN-DCGAN with the CNN to verify the effectiveness of augmentation. Statistically significant differences were observed and are shown in Figure 14. DA using DCGAN effectively improved the performance of the classification model (CNN). Among the proposed CNN-DA methods, CNN-DCGAN outperformed in terms of the classification performance. In addition, the p-values for the comparison of the CNN-DCGAN and proposed methods are shown in Table 10. The classification performance of CNN-DCGAN was significantly higher than other methods (p < 0.01). Although CNN-VAE was second to CNN-DCGAN in dataset 2b (p < 0.05), CNN-DCGAN obtained the best p-values. In summary, the DCGAN provided effective DA and resulted in the highest classification performance.

Comparison with Existing Classification Methods
We compared the classification performance of the CNN-DCGAN hybrid model with that of existing methods (Figure 15). The results are shown in Table 11. The CNN-DCGAN exhibited a 0.072 improvement in the mean kappa value over the winning algorithm for the BCI competition IV dataset 2b [62]. The strategy proved favorable in the DNN for the classification of the MI-EEG signal, and the proposed model achieved comparable or better results than the other methods.

Comparison with Existing Classification Methods
We compared the classification performance of the CNN-DCGAN hybrid model with that of existing methods (Figure 15). The results are shown in Table 11. The CNN-DCGAN exhibited a 0.072 improvement in the mean kappa value over the winning algorithm for the BCI competition IV dataset 2b [62]. The strategy proved favorable in the DNN for the classification of the MI-EEG signal, and the proposed model achieved comparable or better results than the other methods.

Discussion
In this study, we proposed a method to augment and generate EEG data to address the problem of small-scale datasets in deep learning applications for MI tasks. The BCI Competition IV dataset 1 and 2b were used to evaluate the method. We used a new form of input in the CNN that considered the time-frequency and energy characteristics of the MI signals to perform the classifications.

Discussion
In this study, we proposed a method to augment and generate EEG data to address the problem of small-scale datasets in deep learning applications for MI tasks. The BCI Competition IV dataset 1 and 2b were used to evaluate the method. We used a new form of input in the CNN that considered the time-frequency and energy characteristics of the MI signals to perform the classifications. Different DA methods were used for the MI classification. The results showed that the classification accuracy and mean kappa values of the DA based on DCGAN were highest for the two datasets, indicating that the CNN-DCGAN was the preferred method to classify MI signals and DCGAN was an effective DA strategy.
Recently, a growing number of researchers have used deep learning networks to decode EEG signals [60]. However, it remains a challenge to find the optimal representation of an EEG signal that is suitable for a classification model based on different BCI tasks. For example, the number of channels and the selection of frequency bands are crucial when choosing input data; therefore, different input parameters need to match the neural networks with different structures. Researchers require sufficient knowledge of the implications of using different EEG parameters and choosing classification networks for different forms of input data. In Vernon et al. [68], the deep separate CNN achieved better classification results for time-domain EEG signals because the model structure was highly suitable for the time-domain characteristics of the steady-state visually evoked potentials. AlexNet had excellent classification performance for time-frequency EEG signals after a continuous wavelet transform in Chaudary et al. [69]. In this study, we concluded that MI signals based on time-frequency representation was more suitable as the input of the DNN classification model. In future studies, we will investigate which useful features the convolution kernel learns from the EEG and optimize the structure and parameters of the model accordingly.
In applications of EEG decoding, the performance of a classification model based on DNNs is directly related to the scale of the training data. However, in a BCI system, it is difficult to collect large-scale data due to the strict requirements regarding the subject and experimental environment. Data augmentation provides an enlightening strategy to solve this limitation and we have verified its effectiveness in this manuscript. In a previous study, some research has shown that generative networks provided good performance for the deep interpretation of EEG signals [70]. Therefore, future studies could focus on generative networks to interpret the physiological meaning of EEG signals in depth to improve the explanation of EEG signals and investigate how to design a specific DA model with the requirements of specific tasks. Finally, by combining these methods, we hope to achieve accurate identification of MI tasks using a small sample size.
As an important technology focused on rehabilitation [71,72], MI-BCI aims to replace or recover the motor nervous system functionality that is lost due to disease or injury. In the application of DA for MI-EEG, future work could extend this work in clinical BCI tasks. For example, due to the cerebral injury of stroke patients, it is difficult to collect the available EEG signal that may lead to a long calibration cycle. One approach worth doing is to generate artificial data based on limited real data using a DA strategy and train the decoding model using these data. Additionally, we could also use the proposed methods to assess the difference between patients and healthy people, utilizing the generator to produce "healthy" EEG data based on patients and discriminator models to distinguish whether the current EEG signal is healthy or not. Based on DA for EEG, we may establish the correlation between the EEG signal with a rehabilitation condition. Rafael and Esther [73] used DA methods to simulate EMG signals with different tremor patterns for patients suffering from Parkinson's disease and extended them to different sets of movement protocols. Furthermore, the proposed method has the potential to extend the application in rehabilitation and clinical operations based on BCI in practical applications.

Conclusions
In this study, we proposed a DA method based on the generative adversarial model to improve the classification performance in MI tasks. We utilized two datasets from the BCI competition IV to verify our method and evaluate the classification performance using statistical methods. The results showed that the DCGAN generated high-quality artificial EEG spectrogram data and was the optimal approach among the DA methods compared in this study. The hybrid structure of the CNN-DCGAN outperformed other methods reported in the literature in terms of the classification accuracy. Based on the experimental results, we can conclude that the proposed model was not limited by small-scale datasets and DA provided an effective strategy for EEG decoding based on deep learning. In the future, we will explore specific DA strategies for different mental tasks or signal types in a BCI system.