Recognition of EEG Signals from Imagined Vowels Using Deep Learning Methods

The use of imagined speech with electroencephalographic (EEG) signals is a promising field of brain-computer interfaces (BCI) that seeks communication between areas of the cerebral cortex related to language and devices or machines. However, the complexity of this brain process makes the analysis and classification of this type of signals a relevant topic of research. The goals of this study were: to develop a new algorithm based on Deep Learning (DL), referred to as CNNeeg1-1, to recognize EEG signals in imagined vowel tasks; to create an imagined speech database with 50 subjects specialized in imagined vowels from the Spanish language (/a/,/e/,/i/,/o/,/u/); and to contrast the performance of the CNNeeg1-1 algorithm with the DL Shallow CNN and EEGNet benchmark algorithms using an open access database (BD1) and the newly developed database (BD2). In this study, a mixed variance analysis of variance was conducted to assess the intra-subject and inter-subject training of the proposed algorithms. The results show that for intra-subject training analysis, the best performance among the Shallow CNN, EEGNet, and CNNeeg1-1 methods in classifying imagined vowels (/a/,/e/,/i/,/o/,/u/) was exhibited by CNNeeg1-1, with an accuracy of 65.62% for BD1 database and 85.66% for BD2 database.


Introduction
Brain-computer interfaces (BCI), also referred to as human-machine interfaces, are systems that use brain signals to control computers or hardware devices [1][2][3]. These systems can use invasive or noninvasive recording methods, where the latter stand out because they do not require surgical interventions [4]. Research on BCI is aimed at developing technological solutions in fields like motor and cognitive rehabilitation [5]; assistance in the recovery of compromised communication and/or physical skills [4]; control of video games [6]; augmentative assistance platforms [7][8][9], among others, aimed at improving the user's quality of life and well-being.
Imagined speech (IS) is an innovative technique for BCI applications using voluntary signals. [10][11][12]. Imagined speech is the internal pronunciation of phonemes, words, or sentences, without the movement of the phonatory apparatus or any audible output [13]. In this sense, previous imagined speech works with conventional machine learning (ML) methods for imagined vowel recognition (/a/,/e/,/i/,/o/,/u/), have chosen to use time, frequency, or time-frequency transformations as the feature vector. Among the features that have been used for imagined vowel recognition (/a/,/e/,/i/,/o/,/u/) with ML are: statistical descriptors (average power, mean, variance, and standard deviation) [14]; common special patterns (CSPs) filtering, and adaptive collection (AC) [15]; Discrete Wavelet Transform (DWT) [16,17]; eigenvalues of the covariance matrix [18]; and mixed

Related Work
There are different methodologies for the non-invasive capturing of brain signals such as magnetoencephalography (MEG) [21,22], functional magnetic resonance imaging [23,24] and electroencephalography (EEG) [25,26]. The advantages that electroencephalography has over the other methods are its low cost, portability, and high time resolution [27]. The stages of a BCI processing system with EEG are: signal acquisition, preprocessing, feature extraction, classification, and device control [7].
Among the types of noninvasive EEG signals used for BCI control are evoked potentials and voluntary signals [28]. The first require external stimuli and include signals such as: Event Related Potential (ERP), Evoked Potential (P300), Movement Related Cortical Potential (MRCP), and Steady State Evoked Potentials (SSEP) [28]. On the other hand, voluntary signals are produced autonomously by the user such as: sensorimotor rhythms (SMR), slow cortical potentials (SCP), motor imagery (MI), and non-motor cognitive signals [28]. The studies conducted with MI sought to mimic motor intention (without using the muscular system), mainly using event-related desynchronization or event-related synchronization (ERS/ERD) signals [3,8]. However, MI requires a high degree of training to mitigate the effects of user attention and the consequent mental fatigue [4,9]. Within the context of voluntary signals, BCI systems are being developed based on high-level cognitive processes such as: mental mathematical operations, visual counting, musical imagination, imagined speech, among others [1,28]. One of the advantages of these new methods is the number of tasks that can be classified. However, these new methods are limited by the current knowledge in the field of neuroscience, cognitive science, artificial intelligence, among others. Some classifiers that have been used for imagined vowel recognition (/a/,/e/,/i/,/o/, /u/) are summarized in the following table (Table 1): In addition, it is noteworthy that the complexity in the processing of EEG signals is mainly due to: their voltage range (µV), their low signal to noise ratio (SNR), their non-linearity, non-temporality, and low spatial resolution given by the EEG electrodes. According to these characteristics, conventional ML methods are limited for the recog-nition of this type of signals [29,30]. This poses an important challenge in the design of new algorithms to identify the characteristics of the EEG signal [31,32] and, to select or design the proper classifiers [33,34]. In conclusion, an ideal method should be able to automatically recognize the inherent characteristics of the EEG signal with its nonlinear and nonstationary properties. Table 1. Machine learning classifiers used for imagined vowel recognition (/a/,/e/,/i/,/o/,/u/).
Some DL architectures that have been used for imagined vowel recognition with EEG are summarized in the following table (Table 2): Additionally, to reduce the effect of the low signal to noise ratio of EEG signals, there are alternative DL methods using EEG signal preprocessing for imagined vowels, such as: filtering from 2 Hz to 40 Hz, artifact detection and removal with Independent Component Analysis (ICA), and analysis with Hessian approximation preconditioning; eigenvalues of the covariance matrix [18]; 50 Hz LPF-IIR low-pass filters, 0.5 Hz HPF-IIR high-pass filters, and feature vectors consisting of EEG coherence, partial directed coherence (PDC), Direct Transfer Function (DFT) and transfer entropy [40].
Although, DL architectures have been successfully applied to image recognition [43][44][45] and speech signal recognition [46][47][48], their use for EEG signal recognition tasks, such as imagined speech [49,50], remains a challenge and requires the development of novel preprocessing techniques and the development of new DL structures and architectures [51,52]. Among the difficulties posed by DL algorithms are: CNN methods are susceptible to the effect of artifacts present in EEG signals, generating a reduction in the accuracy of the classifiers [27]. These methods are also affected by the reduced amount of data used in the training process [27]; the use of brain rhythms with fixed ranges as inputs to different CNNs can cause a decrease in classification accuracy since some of these rhythms may not provide the information needed for the system to extract the features from the targeted EEG signal [27]; DNN has accuracy limitations due to the number of subjects in the sample, inter-subject analysis, and the amount of time an experiment may take [28]; DL with Shallow CNN, Deep CNN and EEGNet are susceptible to the size of the experimental dataset and the reduced number of tests per class [17]. Additionally, DL architectures are susceptible to overfitting, which consists of overtraining the neural networks, generating a decrease in classification accuracy during testing [35]. The Shallow CNN and EEGNet architectures are going to be used as benchmarks for the proposed architecture, so they are described in more detail in Appendices A and B.

Data Description
This research used the reference database developed by Coretto et al. that involved 15 subjects and the imagined vowel tasks (/a/,/e/,/i/,/o/,/u/) [16]. It also includes a new database with 50 individuals, recorded under controlled conditions, for imagined vowels (/a/,/e/,/i/,/o/,/u/) developed specifically for this research.
The experimental protocol for this database consisted in asking each subject to sit on a chair one meter away from an LCD screen. Once seated, they were shown a message on the screen for two seconds warning them to get ready. Then, they were shown the vowel they had to imagine for two seconds. Next, they imagined the vowel continuously for four seconds. Finally, they were shown a message on the screen indicating them to rest for four seconds. This procedure was repeated 40 times for each imagined vowel [16].
In this database, the signals were recorded with an 18-electrode Grass device at a sampling frequency of 1024 Hz. The EEG electrodes were located according to the international 10-20 system and the database contains information from six electrodes F3, F4, C3, C5, P3, and P4 [16].

New Database (BD2)
This new database, created by us specifically for this study, held the information of 50 university students (20 women and 30 men) whose native language is Spanish (M = 24.76, SD = 7.66) (https://github.com/carlos-sarmientov/DATABASE-IMAGINED-VOWELS-1 accessed on: 4 August 2021). The participants did not exhibit any medical or neurological conditions. The experiment was approved by the Ethics Committee of the School of Medicine at Universidad Nacional de Colombia and the subjects gave written consent for their participation.
The experiment was conducted in the Cognition and Intelligent Systems laboratory at Universidad Pedagógica Nacional (Bogotá-Colombia), under controlled conditions: 80 lm/m 2 lighting and minimum environmental noise (ASTM STC 63). First, each subject was asked to sit on a comfortable chair and an EEG neuroheadset was placed on their heads. The neuroheadset has 14 electrodes located on the left hemisphere, covering the language area. Two reference electrodes were located on the forehead. The electrodes were placed according to Hickok and Poeppel's neurological model of language related to the sensorimotor interface and articulatory network (Broca's area and motor cortex) related to Brodmann areas: 4, 6, 43, 44 and 45 [20]. The electrodes were placed on the neuroheadset in a matrix-like structure where the rows and columns of electrodes, were 18 mm apart. To reference the neuroheadset on the head of each subject, the T3 and C3 positions were used according to the 10-20 system (Figure 1). Once the headset was secured, a light source, placed at one meter from the subject, was lit to indicate the moment when they should start or finish the task of thinking about a specific vowel with imagined speech. To decrease blinking and eye movement artifacts, subjects were asked to keep their eyes closed. For the experiment, each subject was told to imagine a given vowel continuously and without pronouncing it while the light source was on. They were also told that, when the light source was turned off, they had to stop imagining the vowel and relax their body. During the experiment, the light source remained on for four seconds and then was turned off for three seconds. The procedure was repeated 25 times for each one of the imagined vowels. Upon completion of the 25 imagined speech tasks for each vowel, subjects rested for 5 min to continue with the next vowel. The imagined tasks were arranged in the following order: /a/,/e/,/i/,/o/,/u/( Figure 2). The EpocSimulinkImporter acquisition software from Xcessity (Linz, Austria) was used to export the data to Matlab's Simulink. Signal preprocessing was performed with Matlab R2020a. Additionally, signal processing was performed with: Matlab R2020a using the Deep Learning Toolbox for the CNNeeg1-1 model, Jupyter Notebook (Anaconda3) with Python 3.0 using TensorFlow and Keras for the Shallow CNN and EEGNet models. Data analysis was carried out using the Statistical Package for the Social Sciences (SPSS) Version 25 software (Armonk, NY, USA).

Deep Learning Methods with Convolutional Neural Networks (CNN)
Following is the description of this research's proposed architectures. The first one corresponds to the new proposed method. Another two benchmark methods using CNN reported for imagined speech are included [17].

CNNeeg1-1 Architecture
The proposed architecture consists of 10 signal preprocessing blocks for each one of the 10 CNNs, used for the recognition of imaged vowel pairs and one stage for the one-against-one function (1-1) that allows multi-class classification of imagined vowels (/a/,/e/,/i/,/o/ and /u/). The proposed DL-based architecture is described below.

Preprocessing
The proposed architecture consists of 10 preprocessing blocks that filter and adapt the brain signals to deliver it to each CNN. Each preprocessing block is mainly composed of a filtering stage using Adaptive-Projection Intrinsically Transformed MEMD (APIT-MEMD) and a signal transformation stage using spectral analysis. The brain signals recorded were edited to keep only the intervals in which the subjects performed the corresponding imagined speech tasks. The signals were divided in trials with 64 samples and an overlap of 85%.
For the filtering process, the APIT-MEMD method was chosen since the signals have nonlinear and nonstationary characteristics [53]. This method separates the multivariate signals into so-called Intrinsic Mode Functions (IMFs). It includes the following steps [53]:

1.
For each multidimensional input frame [x(t)] T t=1 and each shift operation x(t), decompose the covariance matrix as C = E ss T = WΛW T , where W = [w 1 , w 2 , . . . , w n ] is the eigenvector matrix, and Λ = diag{λ 1 , λ 2 , . . . , λ n } is the eigenvalue matrix. In this case the largest eigenvalue will correspond to the eigenvector w 1 .

2.
Take the first principal component and build a vector pointing in the opposite direction to w 01 = −w 1 .

3.
Using the Hammerseley sequence on a uniformly sampled sphere, build a set of K direction vectors p θ k K k=1 .

4.
Calculate the Euclidean distances from each of the uniform direction vectors to w 1 .

5.
Relocate half of the projection vectors p where α is used to control the density of the relocated vectors. 6.
The other half of the uniform projection vectors,p θ k w 1 , the closest to w 01 , are relocated , where α is used to control the density of the relocated vectors.

7.
Project the multidimensional signal [x(t)] T t=1 along the direction vectors found in steps 5 and 6.

8.
Find the instant of time t θ j i corresponding to the maximum of the projected data sets, where θ j is the angle of the (n − 1) dimensional sphere and j is the index of the direction vectors.

9.
Interpolate t 10. Estimate the mean of the envelope curves for the set of direction vectors J: 12. Repeat these steps until the residue meets the conditions of an IMF for multivariate signals.
The first two IMFs resulting from applying the APIT-MEMD algorithm to the brain signals, (IMF1, IMF2), are chosen for this architecture. They have center frequencies of approximately 30 Hz and 15 Hz, respectively ( Figure 3). These two IMFs are added for each one of the 14 electrodes.
With the signals obtained from APIT-MEMD, a transformation between electrode pairs is performed according to the following equation: abs FFT(E i ) − FFT E j , where E i and E j represent each electrode, where i, j = 1, . . . , 14, and j > i. The values are normalized between 0 and 1. After this, each trial of databases BD1 and BD2 is converted into a jpeg-image. The images are 15 × 32 for BD1 and 91 × 32 for BD2. The rows of these images correspond to the frequencies and the columns correspond to the pairs-differences between electrodes. Database BD1 results in 1888 images for each imagined vowel, for a total of 9440 images for the training and testing of the CNNs. Database BD2 produces 1274 images for each imagined vowel, for a total of 6370 images for the training and testing of the CNNs. A NVIDIA GeForce GTX 1080 Ti GPU with 11 Gbps next generation GDDR5X memory and a large frame buffer of 11 GB was used. The algorithm was implemented with Matlab 2020a using the Deep Learning Toolbox. For the training of the CNN networks, the stochastic gradient descent with momentum (SGDM) optimizer was used. The learning rate chosen was 0.01. The number of epochs was 50. In this way, the values of hyperparameters learning rate, training epochs and activation function were selected according to [17]. 70% of the data was used for training and 30% for validation. The architecture of the CNNs is described below.  The input layer for each CNN receives the information of the images obtained from the EEG imagined vowels. It consists of a tensor of size 32 × 15 × 1 for database BD1 and 32 × 91 × 1 for database BD2 (Table 3). Next comes the dropout layer that randomly sets, for each input image, a mask with 25% of its elements to zero, with the goal of minimizing the overfitting in the training process (Table 3). Layer 3 is a 2D convolutional layer that applies a sliding convolution filter on the input. For this layer, 50 filters are configured with a size of 5 × 5, a stride of 1 × 1, and a padding of 0; thus, the output has a size of 28 × 57 × 50 (Table 3). In layer 4, a batch normalization is applied to improve the training of the convolutional networks and reduce the sensitivity to network initialization. It is applied to the 50 input channels of the layer. In layer 5, the reluLayer function is applied, where Layer 6 is a max pooling layer where a downsampling divides the input into rectangular regions. Then, the maximum value of each region is calculated ( Table 3). The size of each region was 2 × 2, with a stride of 2 × 2, and a padding of 0; thus, the output has a size of 14 × 43 × 50 (Table 3).
Next, a 2D convolutional layer is implemented in layer 7. For this layer, 50 filters with size of 11 × 11, a stride of 1 × 1, and a padding of 0 are configured; thus, the output has a size of 4 × 33 × 60. In layer 8 a batch normalization is applied to the 50 input channels of the layer, and then, in layer 9, the reluLayer function is applied. Layer 10 corresponds to a max pooling layer where the size of each region was selected as 2 × 2, with a stride of 2 × 2, and a padding of 0; thus, the output has a size of 2 × 16 × 60. In layer 11, a batch normalization is applied to improve the training of the convolutional networks and reduce the sensitivity to network initialization, in this case applied to the 60 channels of the previous layer (Table 3).
In layers 12 and 13, two fully connected layers are implemented, multiplying the inputs by a weight matrix to which the corresponding bias vector is added ( Figure 5). Layer 12 has an output size of 60 and layer 13 has an output size of 2, corresponding to the number of classes of each one of the 10 CNNs. Subsequently, in layer 14 (Table 3), the softmax function that calculates cross entropy loss for the corresponding classes is applied. Finally, layer 15 corresponds to the classification output layer of the corresponding CNN. Then, the classification information of the 10 CNNs, is fed to a last block called oneagainst-one (1-1) [54]. The one-against-one function (1-1) has 10 inputs corresponding to the binary classifier outputs of the 10 CNNs (  [54], and it is chosen as the output of this last block.
It is important to underline that CNNeeg1-1 is composed by ten CNN-type algorithms designed to extract the characteristics of the magnitude difference of the FFT of the EEG signals obtained through silent speech. Such differences were calculated between pairs of electrodes. Each CNN of CNNeeg1-1 is based on machine vision architectures with DL [43,44] and classic CNN architectures like LetNet5 and AlexNet [35], since their effectiveness has already been shown. Speaking of the actual architecture, the first layer of each CNN of CNNeeg1-1 is a Dropout layer whose goal is to apply to the image a mask with a certain percentage of ceros randomly located. This layer intends to diminish the possible overfitting resulting from the training of each CNN. The next two blocks contain four layers each as follows: 2D-convolution, Batch normalization, Non-linearity, and Max-Pooling. The goal of the first block is for each CNN to learn the characteristics of the frequency signals through their convolution with 50 × 5 × 5 spatial filters. The goal of the second block is for each CNN to learn the characteristics of the outputs of the first block. This process is performed through convolution with 50 × 11 × 11 spatial filters. The parameters used in these two blocks were obtained through a swept of a value grid, looking to maximize accuracy. With the characteristics found in the training process, the algorithm moves on to the classification stage, made up of two Fully Connected layers and a Softmax layer. The first Fully connected layer is made up of 60 neurons and the second one of 2 since it must classify two types of silent speech signals. The results obtained with the CNNeeg1-1 architecture proposed were compared with the Shallow CNN and EEGNet architectures. These are described in Appendices A and B respectively.

Analysis of Intra-Subject Training Results for the Shallow CNN, EEGNet, and CNNeeg1-1 Algorithms Using Databases BD1 and BD2
The intra-subject training process consists in taking the brain signals from silent speech tasks of each one of the subjects independently, disregarding the ones from the other subjects. The set of signals from each subject is split randomly in a training set, with 70% of the signals, and a testing set, with 30% of the signals. This process is repeated for each subject in each database independently. In consequence, the information of both databases is kept apart, they do not mix.
The statistical analysis, for intra-subject training process, was done using a variance mixed analysis of repeated measures. In this case, using BD1 database, the following results were obtained: For Shallow CNN, a mean and standard deviation accuracy of (M = 0.3171, SD = 0.0114) was achieved. EEGNet achieved an accuracy of (M = 0.3506, SD = 0.0133). Finally, CNNeeg1-1 obtained an accuracy of (M = 0.6562, SD = 0.0123) ( Figure 5).
Mauchly's test indicated that the assumption of sphericity was violated (X(2) = 46.546, p < 0.05), therefore, the degrees of freedom were adjusted with Greenhouse-Geisser (ε = 0.654). Tests for intra-subject effects show significant differences between the classification of imagined vowels performed by the three CNN models with F (1.   There are also significant differences between the Shallow CNN model with BD1 database (M = 0.3171, SD = 0.0114) and BD2 database (M = 0.5371, SD = 0.0606) for imagined vowel recognition (p < 0.05). Also, there are significant differences in the EEGNet model relative to BD1 database (M = 0.3506, SD = 0.0133) and BD2 database (M = 0.7068, SD = 0.0396) for imagined vowel classification in terms of accuracy (p < 0.05). Additionally, there are significant differences of the CNNeeg1-1 model in one case contrasting database BD1 (M = 0.6562, SD = 0.0123) and in the other case contrasting database BD2 (M = 0.8566, SD = 0.0446) for recognition, in terms of imagined vowel accuracy (p < 0.05). Thus, for the three CNN models, the corresponding means for BD2 database are superior when compared to the CNN models for BD1 database (Figure 7).

Subject's Internal Visualization BD2 Database CNNeeg1-1
To visualize the internal representation of the CNNeeg1-1 network, the CAM (Class Activation Mapping) method that predicts the network behavior using class activation was used [55]. The following figures show the internal visualization for a subject in imagined vowel tasks (/a/,/e/,/i/,/o/,/u/) using database BD2 in the layer BN_3 (Table 3). Each figure represents, on the horizontal axis, the pair-wise differences for the 14 electrodes from E1-E2 to E13-E14 and on the vertical axis, the corresponding frequencies. The colors represent the CAM value for each electrode pair and each frequency, which oscillates between 0 to 255. Figure 8 shows a subject's internal representation (CAM) for the task of imagining the vowel /a/. Some of the electrode pairs that are activated the most are: E1-E7 in the frequencies from 12 to 56 Hz; E3-E14 in the range from 56 to 60 Hz; E4-E12 ranging from 14 to 18 Hz; E7-E12, from 6 to 14 Hz and from 46 to 48 Hz; E9-E11 from 4 to 6 Hz; and E9-E12 from 4 to 6 Hz, from 32 to 38 Hz, and from 58 to 62 Hz.   In the case of Figure 10, the subject's internal representation (CAM) for the task of imagining the vowel /i/ is shown. For this case, the electrode pairs that are activated the most are: E1-E2, between 6 to 14 Hz; E1-E11 and E1-E12, between 18 to 22 Hz; E5-E7 and E5-E8, between 2 to 8 Hz; E7-E13, between a frequency of 26 to 30 Hz.  Figure 11 shows a subject's internal representation (CAM) for the task of imagining the vowel /o/. In this case, the electrode pairs that are activated the most are: E1-E6 for frequencies between 30 to 32 Hz; E2-E10, between 24 to 30 Hz and 44 to 50 Hz; E4-E11, between 52 to 54 Hz; E7-E12, between 14 to 20 Hz and 52 to 62 Hz; E7-E14, between 34 to 3 Hz.  Figure 12 shows a subject's internal representation (CAM) for the task of imagining the vowel /u/. The electrode pairs that are activated the most are: E2-E10 for frequencies between 24 to 28 Hz, 38 to 40 Hz, and 52 to 54 Hz; E2-E11 between 54 to 58 Hz; E4-E10 and E4-E11, between 14 to 18 Hz; E7-E12, between 8 to 12 Hz and 52 to 58 Hz; E7-E13, between 8 to 12 Hz and 32 to 34 Hz; E7-E14 between 32 to 34 Hz.

Analysis of the Inter-Subject Training Results for the Shallow CNN, EEGNet, and CNNeeg1-1 Algorithms Using BD1 and BD2 Databases
In contrast, the inter-subject training process takes the signals of all subjects in one of the databases used (15 subjects for database BD1 and 50 subjects for database BD2). When one of the CNN is trained for, for example subject 1 in BD1, the training set is defined as the data from the other 14 subjects in database BD1, except for subject 1, and the testing set is defined as the data from subject 1. For the actual training of the CNN, 70% of the training set is chosen randomly. Once the training is finished, the results are tested with 30% of the testing set, again chosen randomly. This process is then repeated for each one of the remaining subjects in database BD1. The same process is carried on with database BD2 independently, that is, the information in both databases is not combined.
Mauchly's test indicated that the assumption of sphericity was not met (X(2) = 29.749, p < 0.05), therefore, the degrees of freedom were adjusted with Greenhouse-Geisser (ε = 0.724). Tests for intra-subject effects show significant differences between the classification performed by the three CNN models for imagined speech of the vowels with F (1,448,91,231) = 1299.262, p < 0.001, η2 = 0.954. Similarly, the results show that there is a significant interaction between the intra-subject (CNN Model) and inter-subject (database) variable related to the accuracy F(1,448,91,231) = 73.723, p < 0.001, η2 = 0.539.
The post-hoc analysis for the inter-subject training is discussed below, according to Bonferroni, highlighting the significant differences between pairs of variables in imagined vowel classification (accuracy) processes. Analyzing first the results obtained with BD1 database, significant differences were found between the Shallow CNN model    (Figure 15).
The Tables 4 and 5 show the results of the training processes for the three CNN algorithms (Shallow CNN, EEGNet, and CNNeeg1-1) with database BD1 (Table 4) and database BD2 ( Table 5). The tables specify the intra-subject and inter-subject training models in terms of accuracy.

Discussion
This research developed a new algorithm based on Deep Learning, referred to as CNNeeg1-1, designed for the recognition of imagined speech patterns (/a/,/e/,/i/,/o/,/u/) based on EEG signals (Figure 4). In addition, a new imagined speech database with 50 Spanish-speaking subjects, named BD2 was created. This database was recorded under artifact-controlled conditions. It is made up of electroencephalographic signals obtained according to Hickok and Poeppel's speech production model, [20] involving the dorsal pathway between the sensorimotor interface and the articulatory network over the left hemisphere, in imagined vowel tasks (/a/,/e/,/i/,/o/,/u/). Finally, the performance of the CNNeeg1-1 algorithm was compared with two reference algorithms: Shallow CNN and EEGNet, performing an analysis of the intra-subject (Figure 7) and inter-subject ( Figure 15) training process using database BD2 (50 subjects) and database BD1 (15 subjects), using a mixed variance analysis of repeated measurements.   [18,40] and an accuracy of 87. 96% with 3 subjects [18]; with RNN an accuracy of 70% with 6 subjects [40]; with CNN an accuracy of 32.75% with 15 subjects [41,42] and an accuracy of 35.68% with 15 subjects [42]. In the case of Shallow CNN, Deep CNN, EEGNet, for 15 subjects, accuracies of 29.62%, 29.06%, and 30.08% respectively, have been achieved [17]. Thus, it is evidenced that the CNNeeg1-1 model has a better performance for the recognition of imagined vowels (/a/,/e/,/i/,/o/,/u/) compared to other DL methods.
On the other hand, studies developed with conventional techniques using machine learning in imagined vowel classification tasks (/a/,/e/,/i/,/o/,/u/) exhibit outstanding performance with algorithms such as: ELM, ELM-L, ELM-R, SVM-R, and LDA with accuracies from 50% to 90% with 5 subjects and 64 electrodes [19]; SVM-G, RVM-G, and RVM-L with accuracies from 77% to 79% with 5 subjects and 19 electrodes [15]; and with SVM, Random forest, rLDA with accuracies of 22.23%, 23.08%, and 25.82%, respectively, have been achieved with 15 subjects and 6 electrodes [17]. Thus, it is evident that the CNNeeg1-1 model (Figures 7 and 15) has an accuracy that is comparable and in some cases higher for the recognition of imagined vowels (/a/,/e/,/i/,/o/,/u/) compared to the previously described ML methods.
As a strategy to evaluate the imagined vowel (/a/,/e/,/i/,/o/,/u/) classification ability of the CNNeeg1-1 architecture (Figure 4), it was compared with two previously reported reference architectures for imagined vowels classification: Shallow CNN (Appendix A, Figure A1) and EEGNet ( Figure A2). The results of intra-subject training with BD1 and BD2 databases indicate that there are significant differences between the three CNN models (Shallow CNN, EEGNet, and CNNeeg1-1) with F (1.31,82.46) = 1017.50, p < 0.001, η2 = 0.942. For the two databases in the intra-subjects training with post-hoc analysis it was found that there are significant differences between the models for each of the corresponding pairs (p < 0.05). This comparison evidenced that, for the case of BD1 database, the CNNeeg1-1 model obtained the highest average value (M = 0.6562, SD = 0.0123) (Figure 7). Similarly, for BD2 database, the CNNeeg1-1 model obtained the highest average value (M = 0.8566, SD = 0.0446) (Figure 7). The CNNeeg1-1 model not only recognizes imagined vowels (/a/,/e/,/i/,/o/,/u/), but also performs better by showing a higher accuracy than the Shallow CNN and EEGNet models.
When comparing BD1 database and BD2 database, regarding the intra-subject training process, for the three CNN models (Shallow CNN, EEGNet, and CNNeeg1-1) there are significant differences between both databases with F (1,63) = 738.12, p < 0.001, η2 = 0.921. For all three cases, the mean of each of the CNN architectures reported superior performance for BD2 database compared to BD1 database (p < 0.05) (Figure 7). In the case of the intersubject training process, we found that, for the EEGNet and CNNeeg1-1 models, there are significant differences between BD1 database and BD2 database F (1,63) = 50.377, p < 0.001, η2 = 0.444, highlighting that the means are higher for BD2 database than for BD1 database (p < 0.05) ( Figure 15). Thus, the performance of the CNNeeg1-1 architecture is verified by the results in the classification of imagined vowels for both BD1 and BD2 databases. Additionally, the number of subjects in each database: 15 subjects for BD1 database and 50 subjects for BD2 database, verifies the robustness of the CNNeeg1-1 algorithm.
For this study, we sought to place the electrodes (Figure 1) taking into account the speech production model of Hickok & Poeppel [20]. In this model, the speech production process is related to the dorsal branch of the sensorimotor interface and the articulatory network in the left hemisphere, the motor cortex and Broca's area [20]. Thus, the available electrodes (14) were located aiming to cover the corresponding area of the cerebral cortex for the recording of BD2 database (Figure 1). In contrast, the recording of BD1 database was done placing three electrodes on the left hemisphere (F3, C3, and P3) around to the language area and three electrodes (F4, C4, and P4) on the right hemisphere [16]. There are significant differences between the results obtained with both databases with the three CNN models in the case of intra-subject training F (1,63) = 738.12, p < 0.001, η2 = 0.921 and for the EEGNet and CNNeeg1-1 models in the case of inter-subject training F (1,63) = 50. 377, p <0.001, η2 = 0.444. For these cases it was found that the accuracy values in the classification of imagined vowels is higher for BD2 database than for BD1 database (Figures 7 and 15). This indicates that the placement of the electrodes covering the sensorimotor interface and the articulatory network of the Hickok and Poeppel [20] model contributes to the recognition of imagined vowels.
Comparing the BD2 database and BD1 database, we found that BD2 presents a higher accuracy (Figures 7 and 15). One explanation for the higher performance of BD2 is given by the controlled characteristics of the experiment such as: controlled lighting conditions of 80 lm/m 2 and controlled environmental noise conditions (ASTM STC 63). During the recording, the 50 participants were asked to remain seated without moving their limbs, this is reflected in the decrease of artifacts due to EMG type signals. Finally, during the acquisition of the signals, the subjects were asked to keep their eyes closed, in order to reduce artifacts generated by blinking and eye movement.
Regarding the preprocessing of the EEG signals, there are several DL methods that do not perform preprocessing of imagined vowel signals before they are delivered to the different DL architectures, but the results show generally low accuracy values in the classification of imagined vowels [17,41]. Other DL methods perform this preprocessing in different ways, such as: 2 Hz to 40 Hz filtering, artifact detection and removal with ICA and analysis with Hessian approximation preconditioning [42]; eigenvalues of the covariance matrix [18]; 50 Hz LPF-IIR low-pass filters and HPF-IIR high-pass filters of 0. 5 Hz and feature vectors with EEG coherence, partial directed coherence (PDC), direct transfer function (DFT), transfer entropy [40], among others. EEG signals have a low signal to noise ratio, and they are nonlinear and non-stationary. In this study, we chose to perform a preprocessing stage using APIT-MEMD and selecting just a few IMFs. This step is followed by the application of differences in the FFT of EEG signals between electrodes for the Shallow CNN, EEGNet, and CNNeeg1-1 models (Figures 4, A1 and A2).
Among the DL architectures that have been used for imagined vowel recognition are: DBN [18,40], RNN [40], CNN [41,42], Shallow CNN [17], and EEGNet [17]. All these architectures have tended to use a single neural network with different layers for multiclass recognition of the imagined vowels. Theses architectures have common elements such as: 2D convolution layers, max pooling layers, nonlinearity function layers, batch norm layers, etc. For this study, an architecture called CNNeeg1-1 was designed, which consists of 10 CNNs and a one-against-one fusion (  Figure 5). According to the information received from the 10 CNNs, the 1-1 function selects with the one-against-one method the imagined vowel class (Figure 4). In this sense, the performance of the CNNeeg1-1 architecture is corroborated with the results in the classification of imagined vowels for both BD1 and BD2 databases (Figures 7 and 15). Additionally, the number of subjects, 15 for BD1 database and 50 for BD2 database, verifies the robustness of the CNNeeg1-1 algorithm.
Among the limitations of the present study are the following: the capture and processing of brain signal data with imagined speech was performed offline. The experiment was carried out in a single session and it is advisable to perform several sessions in future research. It is advisable to increase the number of electrodes on the language area of the left hemisphere in future research. In the present study, we worked with imagined vowels, but we suggest exploring other language elements such as words. Finally, it is advisable to design other DL architectures to increase the accuracy in data classification.
In general terms, the method based on CNNeeg1-1 for imagined vowels classification does not require demanding training processes, as in the case of imagined motor tasks [56]; it does not require a rigorous attention process like in SSVEP [57,58], P300, or imagined motor tasks [59]; it does not require an external stimulus like SSVEP or P300 [60,61]; and it does not require cognitive tasks that generate muscular or cognitive fatigue as in imagined motor tasks [56,59]. Consequently, the CNNeeg1-1 method developed in this study has the potential to use other language components and to be applied in such relevant fields as BCI device control.

Conclusions
This study developed and tested a new algorithm called CNNeeg1-1 based on DL for EEG imagined vowel signal recognition using two different databases: BD1, with 15 subjects and BD2, with 50 subjects. The latter was created as part of the study. Among the factors that influenced the performance of CNNeeg1-1 are: the preprocessing stage based on the selection of IMFs calculated with the APIT-MEMD algorithm, together with the selection of the difference of the FFTs between electrodes; in the case of BD2 database, the location of the electrodes over the sensorimotor interface area and articulatory network of the left hemisphere based on the Hickok & Poeppel model; the proprietary architecture of the CNNeeg1-1 that uses 10 CNNs specialized in the recognition of imagined vowel pairs, feeding a one-against-one block, among others.
Additionally, the performance of the CNNeeg1-1 algorithm was compared with two reference algorithms with DL: Shallow CNN and EEGNet using both databases. Statistical results were presented with a mixed analysis of variance of repeated measures for intrasubject and inter-subject training. The results show that CNNeeg1-1 outperforms both Shallow CNN and EEGNet for EEG imagined vowel classification in intra-subject and inter-subject training analysis with both databases. Thus, it is shown that it is possible to classify imagined vowel with the new CNNeeg1-1 algorithm.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of the School of Medicine at Universidad Nacional de Colombia-Bogota, as stated in the Evaluation Act 008-125-17 of May 25th of 2017 from such Committee.
Informed Consent Statement: All subjects gave their informed consent for inclusion before they participated in the study.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. The implementation of this algorithm was performed with Tensorflow and Keras and it was trained in the GPU described in Section 3.2.1.

Abbreviations
The Adaptive Moment Estimator (ADAM) was used to train this network. The learning rate was set to 0.001 and the number of epochs was 60. 70% of the data was used for training and 30% for validation. Below, the architecture of the corresponding CNN is described.
APIT-MEMD was selected for the preprocessing of the signals for this architecture, due to the nonlinear and non-stationary characteristics of the signals. The first two IMFs resulting from APIT-MEMD (IMF1, IMF2) were chosen. These two IMFs were added up for each one of the 14 electrodes. With the signals obtained from APIT-MEMD, a transformation between electrode pairs was performed according to the following equation abs FFT(E i ) − FFT E j , where E i and E j represent each electrode, with i = 1, . . . , 14, j = 1, . . . , 14, and j > i. (Figure A1).
The input layer for the Shallow CNN consists of a tensor of size 15 × 32 × 1 for database BD1 and 91 × 32 × 1 for database BD2. This layer receives the image information obtained from the EEG imagined vowels. The conv2d layer is a 2D convolutional layer that applies a sliding temporal convolution filter on the input. For this layer, 40 filters of size [1 × 13] are used and the dimension of the output is 6 × 52 × 40. The next layer (conv2d_1) is a 2D convolutional layer that applies another sliding spatial convolution filter on the input ( Figure A1). For this layer, 40 filters of size 6 × 1 are used, producing an output of size 1 × 52 × 40. In the batch_normalization layer, a batch normalization is applied to improve the training of the convolutional networks and reduce their sensitivity to network initialization. In this case it is applied to the 40 input channels of the layer. In the activation layer, the LeakyRELU function with α = 0.1 is applied. The average_pooling2d layer performs an average pooling, where a downsampling that divides the input into rectangular regions is developed to subsequently calculate the average value of each region. The size of each region was 1 × 35, with a stride of 1 × 7; thus, the output has a size of 1 × 3 × 40. Subsequently, the activation_1 layer applies the LeakyRELU function with α = 0.1. The dropout layer randomly sets for each input image a mask with 25% of elements in zero to minimize overfitting effects ( Figure A1). In the flatten layer, a flattening process occurs at a size of 120 and finally, the softmax function is applied, which calculates cross entropy loss for the 5 corresponding vowel classes of imagined vowels (/a/,/e/,/i/,/o/,/u/).

Appendix B. (EEGNet Architecture)
The implementation of this algorithm was performed with Tensorflow and Keras and it was trained in the GPU described in Section 3.2.1. The training of the CNN EEGNet used the Adam (Adaptive moment estimation) optimizer. The learning rate was set to 0.001 and the number of epochs was 80. 70% of the data was used for training and 30% for validation. The architecture of the corresponding CNN is described below.
The preprocessing of the signals for this architecture uses the APIT-MEMD method and selects the first two IMFs (IMF1, IMF2). The chosen IMFs are added up for each one of the 14 electrodes. The outputs of APIT-MEMD are processed with a transformation between electrode pairs according to the following equation abs FFT(E i ) − FFT E j where E i and E j represent each electrode, and i = 1, . . . , 14, j = 1, . . . , 14, and j > i ( Figure A2).
The input layer for the EEGNet consists of a tensor of size 15 × 32 × 1 for database BD1 and 91 × 32 × 1 for database BD2. This layer receives the images obtained from the EEG imagined vowels. The conv2d_1 layer consists of a 2D convolutional layer that applies a sliding temporal convolution filter to the input. For this layer, 8 filters size 1 × 64 are configured and the output has a dimension of 6 × 64 × 8. In the batch_normalization_3 layer, a batch normalization is applied to improve the training of the convolutional networks and re-duce sensitivity to network initialization. The next layer, depthwise_conv2d_1, consists of a DepthwiseConv2D convolutional layer that applies a separable Depthwise 2D convolution sliding filter on the input. For this layer, a size of 6 × 1 is configured producing an output of 1 × 64 × 16. Next, the batch_normalization_4 layer applies a batch normalization to improve the training of the convolutional networks ( Figure A2). In the activation_2 layer, we apply the ELU function, where The hyperparameter α controls the value where the function saturates for negative layer inputs and it diminish the vanishing gradient effect. In this case, α = 1 is selected according [17,62]. Next, the average_pooling2d_2 layer corresponds to an average pooling layer where performs a downsampling that divides the input into rectangular regions to subsequently calculate the average value of each region. The size of each region was 1 × 4, so the output has a size of 1 × 16 × 16. The dropout_2 dropout layer randomly sets, for each input image, a mask with 25% of its elements set to zero to minimize overfitting effects. The next layer, separable_conv2d_1, applies a separable 2D convolution sliding Depthwise filter on the input. For this layer, 16 filters of size 1 × 16 are configured producing an output of 1 × 16 × 16 ( Figure A2). Subsequently, in the batch_normalization_5 layer a batch normalization is applied, followed by the activation_3 layer that applies the ELU function with α = 1. Next, the average_pooling2d_3 layer calculates the average value of each region. In this case, the size of each region was 1 × 8, thus the output has a size of 1 × 2 × 16 ( Figure A2).
Next, the dropout layer randomly sets, for each input image, a mask with 50% of its elements set to zero to minimize overfitting effects. In the flatten layer, a flattening to a size of 32 takes place, and finally, the softmax function is applied which calculates the cross-entropy loss for the 5 corresponding vowel classes with imagined vowels (/a/,/e/,/i/,/o/,/u/) ( Figure A2).