Brain–computer interfaces (BCIs) are a movement-independent communicative device, enabling users to interact with a computer through brain activity alone [1
]. Traditional approaches to BCIs have harnessed brain activity corresponding to mental states such as motor imagery (MI) (e.g., [2
]) or steady state visually-evoked potential (SSVEP) (e.g., [3
]) and have been utilized for applications ranging from gaming (see [4
] for review) to stroke rehabilitation [5
]. However, there is a growing body of research into the development of a direct-speech BCI (DS-BCI), utilizing imagined speech as a communicative modality [6
]. Imagined speech is the internal pronunciation of phonemes, words or sentences, independent of movement and without any audible output [7
]. To date, it has been considered in relatively few BCI studies, as the field has favored MI, P300 or SSVEP paradigms (see [8
] for review). However, a DS-BCI offers the possibility of a more naturalistic form of communication [9
], relying on neural recordings corresponding to units of language rather than some unrelated brain activity [12
]. One recent study has demonstrated the potential for spoken sentences to be synthesized from neural activity [14
] and another has shown speech reconstruction directly from the auditory cortex while subjects listened to overt speech [15
]. Non-invasive studies using magnetoencephalogram [16
] and EEG [17
] have demonstrated potential for decoding speech using these technologies. Several studies have used the advantage of non-invasive recording through EEG to investigate imagined speech as a communicative paradigm for BCIs (e.g., [17
]). Traditional approaches to BCIs require feature extraction and classification algorithms designed to decode a specific control signal. EEG signals are non-linear and non-stationary and therefore highly complex [24
], with classification typically achieved using features selected from some form of non-linear analysis and a machine learning (ML) algorithm. Most imagined speech EEG studies have used traditional BCI approaches to feature extraction and classification. Common spatial patterns (CSP) [25
], autoregressive coefficients [26
], spectro-temporal features [23
], Riemannian manifold features [17
] and Mel Frequency Cepstral Coefficients (MFCC) [18
] are among those used to represent imagined speech EEG. As with feature extraction, imagined speech decoding has relied on several classification approaches typical to BCIs. The Support Vector Machine (SVM) has been the most common approach to date [18
], showing results in the range of 77.5%–100% in binary vowel classification tasks [29
] and 92.46% average accuracy in a two-class, meaning-based imagined speech task [20
]. Linear Discriminant Analysis (LDA) has also been used for DS-BCI decoding [27
], with one study reporting accuracies in the range of 66.4–76.0% in two-class imagined phoneme production tasks [31
]. Other classifiers applied to imagined speech include Naïve Bayes [27
], k-Nearest Neighbors [18
], Random Forests (RdF) [30
], Relevance Vector Machines [18
], and Sparse Logistic Regression [22
Deep learning (DL) techniques such as convolutional neural networks (CNN), recurrent neural networks (RNN) and others, have been successful in several fields of research, including computer vision [32
] and automatic speech recognition (ASR) [33
]. More recently, researchers have begun applying DL methods to BCI decoding challenges, and EEG analysis in general. One of the virtues of DL is the optimization of both feature extraction procedure and classifier in tandem, enabling a CNN for example, to learn both simultaneously [34
]. DL has been applied across the spectrum of BCI and EEG paradigms, including MI [35
] and epilepsy detection [24
]. Of all DL methods, CNNs are the most common approach to EEG tasks, having been cited in 40% of studies [36
]. Important works in the field include use of CNNs for SSVEP classification [37
], P300 detection [38
] and classification of mental workload from EEG using time-frequency transforms [39
]. Recently, deep transfer learning with CNNs has been used for EEG-based BCI applications [40
]. See [36
] for a systematic review on EEG-based applications of DL.
Although a multilayer perceptron (MLP) ANN has been used to classify imagined speech with 63.2% mean accuracy in a yes vs. no task [21
] and CNNs used to classify imagined speech in a multi-dataset study [42
], there have been relatively few studies applying any form of DL in this area [43
]. Indeed, it is still unclear whether DL methods provide consistent performance improvements over traditional ML approaches for EEG data [36
]. Furthermore, despite recent studies implementing nested cross-validation (nCV) for hyperparameter (HP) optimization [45
], robust consideration of HP selection in DL-EEG studies has been severely lacking in the literature, with almost 80% not mentioning HP searching at all [36
]. Of the 21% of all DL EEG studies which considered HP optimization, the majority applied trial and error or grid search. Statistical analyses and effective reporting of results were almost completely absent from those studies that did implement some form of HP optimization. Sixteen studies have recently been highlighted by two topic-specific review papers for their use of HP optimization strategies [36
]. However, further analysis of these studies indicates that it is almost impossible to infer any significant information on valid HPs for DL-EEG. Furthermore, there is virtually no comparison of the effects of different HP values, no discussion of the interaction between HPs and different models, and no statistical analysis of these effects.
Of the reviewed studies, nine considered CNNs in at least one form. Methods vary across these studies and include grid search [47
], Bayesian methods [49
] (one fails to report the specific approach [49
]), trial and error [24
], and unstated approaches, likely indicating trial and error [53
]. In six of these studies, only partial results are reported in relation to HPs [24
]. For example, in an otherwise excellent paper [50
], only present optimal values for structural parameters were tested and the authors completely fail to report on the effects of optimizing learning rate and learning rate decay. There are similar omissions in relation to HP optimization in each of the six studies cited above. In the remaining three CNN studies highlighted by the review papers, no results on HP optimization are presented at all [47
]. Not one of these studies presents any analysis of observed differences between HPs or their interaction with a given model, and none report any statistical analysis of results pertaining to HP optimization. In almost all of these papers, perhaps excluding [50
], reporting of the approach to HP optimization is poor. Additionally, there are no interpretable results, no comparison of individual HPs or interaction between HPs, and no statistical analysis of anything relating to HPs. Other studies cited in the reviews [36
], although employing DL models other than CNNs, follow this pattern of inadequate reporting of HP optimization.
It is the above points regarding the relative performance of DL and ML methods, and the lack of interest in HP optimization, that have informed the aims and methodology used here. There were three main aims to this study. First, we wanted to determine whether DL methods provided a significant improvement over benchmark ML approaches. Second, we wanted to undertake a robust, statistically validated, analysis of HP optimization, which would make it explicit which HPs were most effective. Third, we wished to investigate whether subject-specific HPs were always necessary, or whether generalization of HPs across subjects was feasible without significant loss of performance.
We take an EEG dataset recorded while participants perform imagined speech tasks in Spanish and perform classification using three different CNNs. The performances of the CNNs are compared to three traditional ML benchmark classifiers. The CNNs utilized have been selected for this study as they have each been specifically designed for EEG applications [34
]. A nested approach to cross-validation, more commonly applied to standard ML problems, is here applied in a DL context. The nCV method facilitates HP optimization for all classifiers. Statistical analysis of the relative effects of the HP values on the CNNs’ performance, and the interactions between the HPs and the CNNs, provides important information for future approaches to decoding imagined speech, and EEG in general. Two modes of HP optimization are tested. The first is an intra-subject approach where subject-specific HPs are selected. The second is an inter-subject approach where a single set of HPs is selected for all subjects.
This paper presents the methodology, results and discussion of this study, in which we evaluate the relative performance of ML and DL approaches to imagined speech decoding, and the significance of HP optimization in relation to several different algorithms.
The study presented here is one investigating the effects of varying HPs used for constructing and training multiple CNNs, and the impact of optimizing HPs using intra- and inter- subject methodologies. Furthermore, overall classification accuracies of imagined words and vowels obtained using CNNs were compared to three traditional ML approaches. Results obtained from the nCV approach to HP optimization indicate the importance of HP selection when implementing CNNs. For all but the loss function, varying the values of HPs resulted in effects that were both significant across the different HP values and the different CNN architectures. Unsurprisingly, given its current prominence as a non-linear activation function for CNNs, leaky ReLU achieved highest mean validation accuracy scores (Figure 3
). This result was reflected in selection of leaky ReLU for most intra-subject cases and all inter-subject cases for the shallow and deep CNNs. However, ELU was the optimal activation function for use with the EEGNet.
Smaller learning rates (0.001–0.1) achieved the best results with both the shallow and deep CNNs but a swift drop off was apparent when the learning rate was increased to 1. In contrast, the EEGNet performed best with a learning rate of 1, although it is unclear whether this trend would continue if the learning rate was further increased. While accuracies obtained by the EEGNet only increase in small increments after 20 epochs (indicating that less training is required for this network), it nevertheless selected 80 epochs for inter-subject training. In contrast, the shallow and deep CNNs take statistically significant leaps when trained on 40 (p
< 1 × 10–8
) and 60 epochs (p
< 1 × 10–60
), respectively. This is reflected in the results of inter-subject HP optimization, where the number of epochs selected is consistently 60 (Table 2
Taken together, these results highlight the importance of selecting reasonable HPs, and the selection of learning rates and activation functions indicate that these two HPs are critical to performance (Table 2
). However, they also indicate how critical the interaction among HPs is, and how critical the interaction between HPs and the network architecture is. These interactions are not always considered when a CNN is being implemented for classification tasks [36
]. The importance of interactions between HPs within a given network is evidenced by the selection of a learning rate for the shallow and deep CNNs. A learning rate of 0.001 clearly obtains the highest inner-fold validation accuracy when evaluated independently of the other HPs (Figure 4
c). However, as selection of HPs for final model training is based on interaction between different HP combinations, a different learning rate (0.1) is selected as the optimal (Table 2
). The reason for this apparently incongruous result is that the leaky ReLU activation function performed better with a learning rate of 0.1 than 0.01, whereas the learning rate of 0.01 performed best across all activation functions. That is, the HPs selected (Table 2
) are based on validation accuracy as a function of the set (activation function, learning rate, loss and number of epochs), rather than individual maximum values. The criticality of interactions between the HPs and the specific CNNs is indicated by the differential effects of activation functions, learning rates and number of epochs on the three CNNs. The dramatic differences obtained between the shallow and deep CNNs, and the EEGNet, are particularly visible in Figure 4
c,d, where the EEGNet’s optimal learning rate is 1 and it is able to achieve similar accuracies after 20 epochs and 80 epochs. It is also clear from Figure 5
and Table 2
that the different CNNs respond differently to the various HPs tested.
Results in Section 5
indicate that optimization of HPs evaluated for the SVM, RdF and rLDA resulted in differences that were either not significant at all or only significant with 0.01 < p
< 0.05. In comparison, HPs tested with the CNNs were all highly significant, with the exception of the loss functions. These results suggest that, although time-intensive, HP optimization is of greater importance with regards to CNN approaches than it is to traditional ML methods.
In this work, we have detailed the HPs that worked best for each of the CNNs when training on imagined speech EEG data (Table 2
, Figure 5
). We have also presented future users of these networks with effective HPs in relation to activation function, learning rate and number of training epochs. Along with previous works [34
], this should provide researchers with a reasonable benchmark from which to approach other classification problems with these CNNs. The nCV technique applied here is one not often applied in DL contexts [36
], but due to the relatively small quantity of training data available, it was necessary to use this method to enhance model robustness.
Use of an inter-subject or intra-subject method for selecting HPs was not significant for either words (F(1,14) = 0.151, p
= 0.699) or vowels (F(1,14) = 0.626, p
= 0.432)). Although HP optimization has clearly been shown to be a significant factor in the performance of the CNNs, it is not necessarily important for these HPs to be optimized on an inter-subject basis. High inter-subject variability has been a limiting-factor for EEG applications [36
], so the fact that final accuracies obtained could be achieved with global selection of HPs could have important consequences for inter-subject training.
In each case, the CNN architectures achieved higher classification accuracies than the benchmark classifiers and these results were significant (p
< 1 × 10–7
). The range of improvement that the CNNs achieved in comparison with the benchmark (words: 3.94–6.54%; vowels: 3.78–8.02%) is consistent with the reported 5.4% median improvement attributed to DL approaches in comparison with traditional ML methods for EEG classification [36
]. The range of accuracies achieved by the final models (words: 24.35–24.90%; vowels: 28.95–30.25%) indicate that the different CNN architectures are similarly capable of decoding imagined speech EEG. This is despite the significant differences in the selection of HPs required to optimize the different networks, and different inner-fold performance on the validation set. It should also be noted that the CNNs have vastly different numbers of trainable parameters (Supplementary Table S1
), with the compact EEGNet much less computationally complex than either the shallow or deep CNNs; an important factor in interpreting overall performance. In fact, complexity is a key aspect in determining the overall efficacy of a model as it comes with a cost in terms of training time required. Due to the large number of trainable parameters in CNNs, traditional machine learning approaches have been shown to be less time-consuming [71
]. However, this work has demonstrated that CNNs provide a statistically significant advance on traditional approaches. This, and the fact that utilizing trained CNNs to make predictions (with GPUs to help for fast inference) does not differ greatly from traditional methods, supports the use of CNNs for imagined speech decoding. Taken together, the results support the claims of the authors’ of the original papers that these CNNs can generalize well across BCI paradigms [34
] and indicate the potential of DL methods for decoding imagined speech EEG.
Regarding the classification results, it must be made clear that the observed performances of the models are not at a level that would be required for a working DS-BCI. For such an important mode of communication it is imperative that users be supplied with a highly accurate and robust system. Despite recent advances [10
] and high levels of interest, the field is not yet at that stage. Several avenues for investigation are available to researchers seeking to make progress in this regard. These include systematic improvements such as increasing the number of trials and the number of recording channels used. Additionally, experimental improvements can be sought through investigation of the neurolinguistic properties of parts-of-speech and the impact of stimulus presentation methods. However, the findings presented here compare favorably with other studies in the field. Many previous studies have applied binary classification paradigms to the study of imagined words, with average accuracies of 58% [18
], 80.05% [17
] and 92.46% [20
] demonstrating the potential for decoding imagined speech EEG. Mean accuracy of 50.06% in a 3-class classification task involving short imagined words has been reported [17
]. In this work, we have demonstrated the potential to decode a greater range of words from imagined speech EEG by classifying six words with a mean accuracy of 24.90% (EEGNet) and highest single-subject mean of 30.36% (subject 13, deep CNN).
Imagined vowel classification has previously been shown to be feasible, with results in the range of 48.96% for 3-class [17
], and 77.5–100% for binary classifiers [29
]. Here, we have shown that classification of multiple imagined vowels is possible with a mean accuracy of 30.25% (chance: 20%) and highest single-subject mean of 35.83% (subject 7, shallow CNN), providing evidence that linguistic units below the level of the word or sentence, i.e., phonemes, can be distinguished from EEG. However, the promise indicated by these results must be tempered by acknowledgement of the limitations of the study. First among these is the size of the imagined speech dataset and the relatively small number of trials per class (maximum: 40 [56
]). The CNNs were reasonably robust given the number of training samples afforded by the dataset. Although the test accuracy tended to be lower than the training accuracy, possibly indicating some overfitting, this is unsurprising given the small sample size of data which makes it difficult for CNNs to generalize. It is well known that more data typically improves the performance of a CNN, and we agree with the recommendations in [71
], where a search for increasing sample sizes is advocated as a method to ameliorating the large variance and errors associated with cross-validation approaches and small datasets. The second limitation is the small number of EEG channels used in data acquisition. Six electrodes is a small number and restricts the spatial resolution of the EEG data. Given that CNNs work to capture spatial dependencies in the data, the shortage of electrodes likely impacted negatively on overall performance. Other studies related to imagined speech decoding have typically used montages ranging from 16 [30
] to 64 [17
] electrodes, thus providing greater information to decoding models. If a DS-BCI is to become a functional technology, full consideration of each of these issues is essential. It must also be noted that any additional complexity emerging from the use of deeper neural network architectures and potentially longer training times, must be accounted for in the development of real-time BCIs using these models. It can be observed in Figure 5
c that optimizing for accuracy often resulted in selection of 80 training epochs. However, the additional training time required may make the total benefit of this optimization negligible in a real-time scenario.
A weakness of the present study is that the experimental paradigm does not facilitate any analysis of the differential effects of the different stimuli as the prompts are presented concurrently. Additionally, the paradigm has not accounted for possible effects of user training on BCI performance. Subject training, in the form of task repetition and learning through neurofeedback (e.g., [72
]), has been shown to improve the performance of BCIs using other paradigms. It is likely that similar improvements to DS-BCI performance can be obtained through use of effective training strategies aimed at enhancing a user’s ability to interact with a given system. Although the majority of DS-BCI studies have relied on single-session protocols [23
], others have used multiple sessions for data collection [17
]. Studies have demonstrated classification of imagined speech EEG with as few as 30, 40 or 50 trials per class [25
]. However, higher-volume data are required to fully exploit DL models such as CNNs. Multi-session protocols offer several advantages including user training, collection of a greater number of trials per class, and analysis of multiple paradigms. For example, using up to three recording sessions, [17
] were able to collect 100 trials per class while also investigating effects of using vowels, short words and long words. Use of multi-session training and neurofeedback within future experimental protocols is required to ascertain the impact of these strategies.