4.1. Isolated ‘Speech’ vs. ‘Non-Speech’ Classification
Human brain activity during speech and silence has been studied extensively in past decades to obtain a better understanding of the neural mechanism of speech production [32
]. However, the ‘speech’ tasks in many of these studies had been designed to evaluate either non-speech acoustic production or various oro-motor tasks that may not represent speech production in its entirety. Little is known about the neural behavior in pre-speech or post-speech intervals during overt speech production. The pre-speech neural data consists of the reaction time as well as the speech-motor initiation prior to acoustic output. Hence, brain activity during the ‘Pre-Speech’ and ‘Speech’ segments might have significant overlap, with the addition of auditory feedback for the ‘Speech’ segment. Similarly, with ‘Post-Speech’, the auditory and language related cortex process information long after the speech acoustics terminate [34
]. Hence, determining the beginning and end of articulation solely from brain activity may also be challenging. In this study, an SVM classifier was tasked with these objectives.
shows the isolated classification results obtained with the SVM classifier averaged over 8 subjects. A 90% classification accuracy was observed between ‘Speech’ and ‘Non-Speech’ (‘Pre-Speech’ + ‘Post-Speech’) classes. This result provides compelling evidence that differences in the brain signals between speech and silent intervals can be differentiated and used for automated detection of speech onset and offset, and thus providing motivation for developing a NeuroVAD algorithm. Additional classifications were performed by separating the ‘Pre-Speech’ and ‘Post-Speech’ segments into their own classes. Results for all classification combinations are shown in Figure 4
. A 3-class classification for all segments resulted in approximately 85% accuracy, indicating that two of the 3 classes contained redundant information. In comparing classification of ‘Speech’ against ‘Pre-’ or ‘Post-Speech’, the ‘Speech’ vs. ‘Post-Speech’ resulted in higher accuracy. This again highlights the difference between ‘Pre-’ and ‘Post-Speech’ segments. Considering the feature sets for the LSTM-RNN model, it is clear that there is a considerable overlap of feature shapes—of low-frequency modulation—across ‘Pre-Speech’ and ‘Speech’ segments, whereas the ‘Post-Speech’ segment demonstrates fairly low variance. From our movement data we know that oromotor movement precedes acoustic onset, which may explain this. Similarly, Feature 4 seems to mirror this differentiation with the dominant sensor location being more stable over low-frequency modulation segments of ‘Pre-Speech’ and ‘Speech’ and more variable as they subside in ‘Post-Speech’. This could mean that Feature 4 tracks the low-dimensional transitions of brain network formation for specific tasks. While this observation generates interesting speculation, the statistical difference between these decoding schemes were not significant (‘Pre-Speech’—‘Speech’ vs. ‘Post-Speech’—‘Speech’, p
= 0.164; 1-tailed t
-test; 95% CI). This may be due to one outlier subject who was the only subject to show better ‘Pre-Speech’ vs. ‘Speech’ classification. This indicates that the study is most likely underpowered for such statistical comparisons. Regardless, the high classification accuracy (>89%) of speech interval compared to ‘Non-Speech’ (both ‘Pre-Speech’ and ‘Post-Speech’ segments) further boosts confidence for designing an automated NeuroVAD system.
4.2. Real-Time NeuroVAD
The result of the automatic neural voice activity detection experiments can be observed in Figure 5
, which shows an exemplary prediction result plotted over the corresponding speech data. The speech signal in Figure 5
a is when subject 8 was speaking “That’s perfect”. The red dotted line indicates the ground truth labeling and the blue dotted lines indicate the prediction by LSTM-RNN. Since the objective was just to find the start and endpoints of a phrase, the silences within a phrase were not designed to be predicted. The predicted start time points were slightly earlier (about a mean of 10 ms) than the truth (labels), which was common for most of the test trials; however, no statistical significance was found for this result with left tailed t
= 0.068). Despite the prediction error and theoretical decrease in the computational accuracy, one interpretation of this behavior is that the neural information corresponding to speech might occur before the actual speech production, as illustrated in Figure 3
, where an increase in the magnitude of feature values from the baseline can be observed slightly before the speech onset. Future analysis, however, is needed to confirm this interpretation. Nevertheless, the predictions are accurate, indicating that the start and end time points are determinable from the neural signals alone.
The overall performance results of the NeuroVAD experiments are shown in Figure 5
b which represents the accuracy of NeuroVAD in single-trial levels for each subject and their average. Here, accuracy = (TP + TN)/(TP + TN + FP + FN); where: TP = true positive; FP = false positive; TN = true negative; FN = false negative. The average accuracy was found to be 88.07% across 8 subjects. The classification accuracy reported in [17
] was about 92%, using isolated classification to differentiate between ‘Speech’ and ‘Non-Speech’. This is similar to our method (see Figure 4
) which classifies with about 90% accuracy. Going further, our result with LSTM-RNN is for predicting ‘Speech’ labels at each time point (real-time) as is done in conventional VAD systems in ASR community. Continuous prediction is crucial for speech-BCI applications for designing conversational systems similar to Alexa or Siri but with neural signals instead of speech signals for ALS patients. Although the accuracy obtained was promising, conventional VAD systems have near-perfect accuracy in current speech assistant technologies. Hence, more research is required to increase NeuroVAD performance. One possible solution is to use various deep learning feature extractors instead of the statistical features used in this study; however, an exponentially larger dataset would be required. It is very difficult to collect a lot of neural data, specifically for the ALS population, the patients get tired eventually during the task, considering the higher amount of motor involvement to compensate for the paralyzed muscles [35
]. Thus, intelligent strategies such as data augmentation or artificial data synthesis and transfer learning techniques might be needed for effective decoding. Also, the machine training strategy has to be optimized in a way that the training and validation should be performed with one session data and testing should be done with the next session data, making the algorithm more robust and invariant across sessions for ALS patients [36
]. Of note that our data preprocessing step involved discarding of artifact biased trials through visual inspection leading to a 25% rejection rate. This might be very costly in the low data regime of ALS population, and thus, instead of trial rejection, data denoising strategies should be implemented [38
In Figure 5
b, it can be observed that the accuracies are somewhat subject specific, possibly, due to the inclusion of all the sensors (whole head) in the feature set. It is challenging to determine the exact combination of sensors that will provide the best performance. Also, intersubject cognitive-performance variance might also have played a role in prediction variance. A statistical frequency (mode) analysis of the sensor indices across all 8 subjects yielded commonalities in sensors in the temporal regions bilaterally with a bias to the left hemisphere Figure 6
a–c. This is consistent with the traditional understanding of left hemisphere dominance of speech production in neuroanatomy. While this analysis highlights common sensors across subjects, there is still large intersubject variability as shown in Figure 6
d. This variability may speak to the high level of inter-individual variation in neuro-cognitive network structure observed in neuroimaging experiments and reflect the necessity of subject-dependent decoding [39
]. Nevertheless, the efficacy of this study is in getting high accuracy in single-trial level with an average standard deviation of <10%. For all the subjects, the median was higher than the mean, with the overall distribution within 85–90%. However, there were a few trials for which the accuracy was not very satisfactory (70–80%). Since the accuracy was calculated in the sample level of single-trials, a 10% misclassification in one trial corresponds to about 200 samples being misclassified. However, the recall of this study across all subjects (92.33%) was higher than the precision (90.92%) indicating fewer false negatives (actual ’Speech’ being predicted as ’Non-Speech’) than false positives. It is also important to mention that most of the false positives were near the boundary where the NeuroVAD algorithm mispredicted the ’Non-Speech’ labeled samples those were just before and after the acoustic onset and offset respectively as ’Speech’. Although it reduces the machine accuracy, this result can be reasoned as the outcome of pre-cognitive planning and post-auditory processing in the brain for speech production, which might increase the BCI performance if included.
4.3. Efficacy of LSTM-RNN for NeuroVAD
In regards to the performance of LSTM-RNN, it is worth mentioning that it was capable of learning the sequential pattern from the features corresponding to speech and non-speech periods. One point of discussion is with the features used for training the LSTM-RNN. The three statistical features used here, especially Feature 1 (sum of absolute values) and Feature 3 (standard deviation) were correlated. However, discarding any of these 3 features from the feature set resulted in about 5% decrease in accuracy. To further analyze, we plotted the difference between these 2 features (Feature 1 and 3) as shown in Figure 7
. Although Feature 1 and 3 seems similar in Figure 3
, they have differences in time and thus both the features were required and contributed to NeuroVAD training. We also implemented various combinations of time frequency representational features as proposed in [17
] but the validation accuracy for NeuroVAD was optimal with the current 4 statistical features. Rigorous hyperparameter tuning was performed to find the best architecture for this objective. To mention a few, around a 4% increase in accuracy was obtained when the number of layers increased from one to two and then saturated with more layers. Similarly, there was around a 3% increase on average accuracy when the number of nodes increased from 64 to 128 and then 256, beyond which accuracy saturated. The addition of another fully connected layer after the LSTM layer did not improve accuracy. Overall, the proposed architecture was found to provide the best performance in voice activity detection and LSTM-RNN was found to be efficient for this objective. As this NeuroVAD approach is constructed to learn the features in the temporal domain, and considering both EEG and ECoG reflect similar data types, a major contribution of this study is that the proposed approach can be translated to the EEG and ECoG modalities.