Deep learning-based classification of fine hand movements from low frequency EEG

The classification of different fine hand movements from EEG signals represents a relevant research challenge, e.g., in brain-computer interface applications for motor rehabilitation. Here, we analyzed two different datasets where fine hand movements (touch, grasp, palmar and lateral grasp) were performed in a self-paced modality. We trained and tested a newly proposed convolutional neural network (CNN), and we compared its classification performance into respect to two well-established machine learning models, namely, a shrinked-LDA and a Random Forest. Compared to previous literature, we took advantage of the knowledge of the neuroscience field, and we trained our CNN model on the so-called Movement Related Cortical Potentials (MRCPs)s. They are EEG amplitude modulations at low frequencies, i.e., (0.3, 3) Hz, that have been proved to encode several properties of the movements, e.g., type of grasp, force level and speed. We showed that CNN achieved good performance in both datasets and they were similar or superior to the baseline models. Also, compared to the baseline, our CNN requires a lighter and faster pre-processing procedure, paving the way for its possible use in an online modality, e.g., for many brain-computer interface applications.


I. INTRODUCTION
The recognition and classification of different fine hand movements from electroencephalography (EEG) signals represents an interesting and challenging research question.Several Brain-Computer Interface (BCI) systems for motor rehabilitation [1]- [3] and other basic neuroscience studies, such as the investigation of the neural mechanisms underlying the writing and the music performance [4], [5], strongly rely on the ability to precisely and effectively distinguish different fine hand movements.Movement Related Cortical Potentials (MRCPs) are amplitude modulations of the time-domain EEG signal, that occur in the (0.5, 4) Hz frequency band [6].MRCPs can be detected during motor execution and imagery, or in an attempted movement, and they reflect the cortical processes involved in the planning and execution of a movement.Previous literature [6] reports that the components of the MRCPs can be influenced by several factors, such as the preparatory state (self-paced or cue-based), the level of intention, the type of movement, the praxis and the previous experience of the same movement, besides the presence of any pathology of the brain structures.Nevertheless, it has been found [7] that MRCPs can also encode several properties of the movements, such as the type of grasp, the force level and the speed of the task.For this reason, for example, MRCPs are considered valid signals to be used for BCI control [7].Here, we fused this knowledge from the neuroscience field, with the potentiality of deep learning, to improve the performance of the classification of touch, grasp, palmar and lateral grasp movements.Previous literature has already investigated the classification of different fine hand movements, including touch and different kinds of grasp.The majority of the studies employed shrinked linear discriminant analysis (sLDA), which is a well-established approach for EEG classification, for its low complexity and good performance even with a limited amount of trials.However, Linear Discriminant Analysis (LDA) and its regularized version, sLDA, are linear classifiers which might score poorly in case of complex non-linear EEG data [8].The aim of this work was to evaluate the performance of a newly proposed convolutional neural network (CNN) model, in comparison with two standard machine learning algorithms, namely sLDA and random forest (RF), in the classification of 3 different classes of movement, using two datasets.The paper is organized as follows.In Section II we present the most relevant previous studies related to our work; in Section III, we describe the experimental protocol, the common steps of pre-processing for all models, our proposed CNN model, and we briefly review the two baseline models chosen as a comparison for performance evaluations.In section IV we report and discuss all results, both from the qualitative analysis of the MRCPs and of the classification of different movements.Finally, Section V concludes the paper, also mentioning the possible impact of this work for other studies.

II. RELATED WORKS
The possibility of decoding touch and grasp actions from low-frequency EEG signals has been shown in other studies [9]- [11].In [9], Ofner et al. classified single upper limb movements with a binary classification approach, recording six different types of movements, both executed and imagined, and rest trials.For the executed movements, in the movements versus rest binary classification the average accuracy reached the value of 87%, while for the movements versus movements the average accuracy dropped down to 55%.For the imagined movements, an accuracy of as less as 27% and 73% was obtained for movements versus movements and rest versus movements classification, respectively.In [10] palmar, lateral and pincer grasps were recorded and classified, in a cue-based paradigm.A 4-class sLDA was used to classify the three movements and the rest data, obtaining a peak accuracy of 65.9%.Moreover, a binary classifier was trained in the same study, for each binary combination of classes.The palmar versus lateral grasp classification obtained a peak accuracy of 73.5%.In [11], both unimanual and bimanual reach and grasp actions were classified with sLDA.Binary combinations of the different movements were also classified separately, leading to average accuracies for the movement classes between 66% and 70%.The highest accuracies were obtained with the rest class versus the movement ones, with performance between 74% to 90%.Recently, new approaches have been rising.Deep learning showed promising results in many different fields of application and has been successfully applied also in the BCI field [8].

III. METHODS
In this section, we first present the experimental protocol used to acquire the two datasets.Second, we describe the common pre-processing pipeline that is used by CNN and the baseline models.Then, we introduce our CNN-based model and the baseline models used for the performance comparison.Finally, we explain the cross-validation procedure and the metric we used as for the evaluation of the performance.

A. Experimental protocol
At the very beginning of the experimental protocol, the participants' handedness was tested with the well-known hand dominance test of [12].Then, they were asked to seat on a comfortable chair in a noise and electromagnetic shielded room.The brain activity was acquired via EEG by means of 4 g.USBamp amplifiers (g.tec medical engineering GmbH, Austria) and a 64 gel-based channel EEG cap (g.GAMMAsys/g.LADYbird, g.tec medical engineering GmbH, Austria).Incidentally, 58 electrodes recorded the brain activity, while 6 of them were used to record the electroencephalogram (EOG).The EEG electrodes' locations were defined by a well-established modified version of the International 10−20 System [13].All data were recorded using a 256 Hz sampling frequency.In the resting position, the participants' right arm was placed, relaxed, upon a pressure button on a table in front of them.They were also recommended to avoid unnecessary body or eye movements, and to fix their gaze at a fixed point, for a few seconds, at the beginning of each repetition of the movement.All movements were self-initiated to ensure a more natural application scenario.Additionally, at beginning, middle and end of the experiment, 3 min rest is repeated 3 times.
Experiment 1 -Touch and Grasp: In the first experiment, 11 healthy volunteers (age 20-38 years old, 11 M) were included.The hand dominance test resulted in 9 right-handed participants, 1 left-handed and 1 undefined.During the experiment, two glasses were on the table at the participant's reaching distance.They were equipped with a pressure sensor, each, in order to precisely detect the grasping onset.The participants were instructed either to grasp the first glass or to touch the second glass for a minimum time of 4 s.Thus, the total duration of each repetition was longer than 5 s.Four sessions of 20 repetitions of the same movement, i.e., grasping and touching, were included in the protocol.Thus, 80 touching and 80 grasping movements were performed by each participant at the end of the experiment.After each session, the participants could take a break and the glasses were switched.The same number of repetitions was performed in both glass' positions.On the computer screen in front of them, they could see the remaining number of trials to perform.
Experiment 2 -Palmar and Lateral: In the second experiment, 15 right handed participants were involved.During the experiment, two jars were on the table at the participant's reaching distance.The first one was empty, while the second had a spoon stuck in it.The participants were instructed either to reach-and-grasp the first jar or the second for a minimum time of 2 s.Thus, the total duration of each repetition was longer than 5 s.They freely decided which movement to perform.To interact with the empty jar, they had to perform a palmar grasp, while for the jar with the spoon, they exploited a lateral grasp.Four sessions of 20 repetitions of the same movement, i.e., palmar or lateral grasp, were included in the protocol.After each session, the participants could take a break and the objects were switched.The same number of repetitions was performed in both objects' positions.

B. Pre-processing
We adopted the same pre-processing pipeline for both the EEG datasets used in this study.The pipeline is a well-established algorithm, previously implemented in [10], [11].The full data processing was implemented in Matlab 2020a [14].First, every EEG signal was band-pass filtered between 0.01 Hz and 100 Hz (Chebyshev filter, order 8).Second, a notch filter was applied to suppress the power line noise at 50 Hz.Additionally, Independent Component Analysis (ICA) could be applied at this point to identify and remove the artifacts due to eye movements, as in [15].Third, a narrower band-pass filter (Butterworth filter, order 4) was applied to extract the signal low-frequency component in the band (0.3, 3) Hz.All filters were implemented using the Matlab function filtfilt in order to compensate for the delay introduced by them.The full dataset, i.e., including all EEG signals, was transformed using the Common Average Reference (CAR) filter [16], a spatial filter used to enhance the signal component due to the brain region under each individual EEG sensor (i.e., discarding components that are spread all around the scalp).Finally, every signal was downsampled to 16 Hz (using the Matlab function resample).
During the experimental sessions, a pressure sensor (either on the table or on the object to interact with, see Section III-A) was exploited to identify the time instants when the individual initiated the movement, i.e., the movement onset.Therefore, proper segmentation of the continuous pre-processed EEG signals was ensured.Each segment (or trial) was defined as the signal's period of time from −2 s to +3 s around each movement onset (i.e., time 0).Not only movement-related trials, but also 5 s rest trials have been obtained from the datasets: they were extracted from the 3 min rest periods (see Section III-A).
In order to include only clean data in the datasets to analyse, we applied a well-established outlier rejection algorithm [17]- [19].A single trial was kept in the dataset if it simultaneously met the following conditions: (1) its absolute amplitude does not exceed 125µV , (2) and its kurtosis does not exceed its standard deviation by 4 times.
Finally, we obtained two different 3 class datasets: dataset 1 includes clean data from Experiment 1, while dataset 2 includes those from Experiment 2. Both dataset can be described as follows: 2 (1) . . .
where i is the total trial number (including all classes of movements), and N the number of time samples available.To note, N varies depending on the learning model used to analyse the data (see Sections III-C and III-D).X (i) can be interpreted as an EEG 2D image.
Moreover, the class of movements can be either touch, grasp, palmar, lateral or rest.Dataset 1 includes touch, grasp and rest classes, while dataset 2 includes palmar, lateral and rest classes.

C. Classification with CNN
The CNN is a particular type of neural network that implements, in at least one of its layers, a convolutional operation [20].In this study, the architecture of the CNN was adapted from [21], [22].As depicted in Fig. 1, it consisted of 7 layers.The first two were convolutional layers: the first one performed a temporal filtering (i.e., convolution along the time axis), while the second one a spatial filtering (i.e., convolution along the channel axis).Each convolutional layer was followed by a batch normalization layer and an exponential linear unit (eLu) activation function.Then, an average pooling layer, which flattened the input to a single dimension, and two fully connected layers were stacked on the top of the convolutional ones.Finally, a softmax activation function returned the probability of each sample to belong to each class.To note, since the kernel size at the output of the second convolutional layer was equal to the number of channels, this filter reduced the channel dimension to one.The input to this CNN was given by the EEG 2D images X (i) , for every available trial i, as computed in eq. ( 1), which resulted in a three dimension tensor.To implement such architecture, several parameters had to be decided: specifically, the kernel size and the depth of the convolutional layers, and the size of the pooling and dense layers.Given each participant, we used a grid-search procedure to optimize such parameters over a-priori selected ranges.Then, the optimal combination of parameter values was given by a majority vote strategy across all participants.As a result, the kernel size of layer 1 (i.e., the first convolutional layer) was equal to 30, while for layer 3 (i.e., the second convolutional layer) corresponded to the number of channels, i.e., 58.Moreover, for both of them the optimal depth was found to be 40 filters.The kernel size of the average pooling layer was equal to 15, the first fully connected layer had 80 neurons, while the second fully connected layer

D. Classification with baseline models
Two state-of-the-art machine learning models were used as a comparison for our proposed CNN: an sLDA and a RF.They both have the advantages to be simple in their implementation, computational light burden, and they showed good performance in EEG classification during hand movements, gesture recognition and BCI experiments.
The LDA is a supervised multi-class classification technique which aims at estimating the parameters of the linear multivariate model of the input data, via parametric density estimation procedure [8].Here, the input to the sLDA is the vector x obtained by reshaping matrix X as follows: (2) where i is the trial number, and N the number of time samples available in the sliding window.The shrinked LDA version, i.e., the sLDA, introduces a regularization strategy, especially useful with high dimensional feature spaces, when only a few data points are available.For the regularization, we considered the pooled covariance matrix, computed from the 3 classes, and we optimized the regularization parameter as in [23].A common approach to obtain the optimal sLDA model with time series, i.e., as in the EEG case, is to train several sLDA models, each one based on a different subset of the training set (e.g., given by a different observation window), and to select the one which yields the best training performance.Thus, here, for each single trial i, a sliding window is used to scan the entire EEG segment from −2 s to 3 s.Then, an sLDA model was obtained for each, every 2, time instant (i.e., one every 125 ms).For each participant, the time instant where the sLDA model resulted in the best classification performance was taken as the trained model.Moreover, three different window lengths were tested for each participant, specifically {0.6, 0.8, 1} s, and the same model training was repeated for every length value.
The RF is a classifier that works as an ensemble of individual decision tree algorithms to reduce the risk of overfitting and, thus, to enhance the classification performance.Each tree is obtained by independently bootstrapping the samples from the input dataset, resulting in uncorrelated models whose predictions are more accurate than the ones we would obtain from a single one [24].Then, a random set of predictors is used at each split to grow the tree [25].To compute the predictions, a majority vote across the predictions of the individual decision trees is used.In this study, the vector in eq. ( 2) was also used as the input to the RF.The number of trees was empirically set to 50, found as the best trade-off between the classification accuracy and the computational complexity.

E. Cross-validation and performance evaluation
The performance of the classifiers were evaluated by means of the accuracy, computed as follows: accuracy = correctly classified instances total number of instances to classify .
The chance level was computed for each model and each participant by means of the Adjusted Wald Interval [26], with α set to 0.05.For both datasets, we split each of them into a training set (75%) and a validation set (25%).During training, a 10 times repeated 5-fold cross validation procedure was adopted to ensure the robustness of the trained model.The validation set was used for testing the performance of the trained models on unseen data.All splits led to representative subsets of the dataset, in order to have balanced classes for an unbiased classification.

IV. RESULTS AND DISCUSSION
In this section, we describe both the quality of our dataset after pre-processing and the results of the classification using the CNN model designed in Section III-C, including the comparison with sLDA and RF.

A. Pre-processing, feature extraction and MRCPs
As a result of the pre-processing (see Section III-B), 3 out of 11 participants (namely, S002, S003, S005) were rejected from the dataset 1 from further analysis, for the massive presence of artifacts in their EEG recordings.Then, the high quality of the clean EEG data after pre-processing is shown in Fig. 2. It reports the subset of EEG segments, after synchronization at the movement onset, for different movement classes and for rest periods, in both datasets.In Fig. 2, we can notice that, in case of any movement, negative values are seen around time zero, i.e. the movement onset, which represent the negative peak of the MRCPs.Moreover, all panels show good repeatability across movement repetitions (i.e., segments).On the opposite, as expected, across the rest segments we cannot notice any clear pattern.We also observed (results not reported for space constraints) that a difference in the MRCPs peak amplitude was especially noticeable at the EEG electrodes located in the contralateral side of the movement and that this spatial pattern is consistent across several participants, in line with other literature [27].However, it is also clear that Dataset 1 is more affected by noise compared to Dataset 2, so that e.g., the touch-related EEG data could show a less pronounced negative peak of the MRCPs (as seen in Fig. 2a).We also observed that this behaviour is consistent across most of the channels, with no specific spatial pattern (results not reported for space constraints).

B. Classification results
Tab.I and Tab.II report the comparison of the classification performance between the CNN and the baseline models over the unseen validation set of the two datasets.They show the results of the classification in terms of accuracy.To achieve these performance, we used the CNN model with the best selection of hyperparameters, employing the same architecture  for all participants.On the other hand, for sLDA and RF we considered all possible choice of the sliding window lenght, with the best window time location, for each participant.The chance level was computed as in Section III-E and it was found to be 0.40.Comparing the classification results among the three classifiers, we can see similar accuracies for all of them, with all values above the chance level.We can also notice that they achieved slightly better results in the Dataset 1, as expected from its higher repeatibility across EEG segments (see Fig. 2), compared to the Dataset 2. However, for both datasets, the CNN model reached the best average accuracy across all participants (0.70 for Dataset 1, 0.64 for Dataset 2).sLDA and RF achieved the best classification accuracy, at the single-subject level in Dataset 1: thus, a particular configuration (i.e., an optimal choice of the window length and time location) can lead a baseline model to yield higher performance compared to CNN.Nevertheless, especially for the Dataset 2, the CNN showed higher variability in the individual participant accuracies, with some of them reaching very high values (0.80 for G12) and others slightly above the chance level (0.43 for G02).Finally, from the confusion matrices (not reported here for space constraints), we observed that the rest class was classified with the highest accuracy compared to the other movement classes (best accuracy among the two datasets: 78% for rest, 57% for touch, 62% for grasp, 55% for palmar, 52% for lateral), in line with previous literature [9]- [11].As expected, the computational complexity for sLDA and RF is significantly lower compared to CNN: the former use less time points as input to train the model (either 0.6s, 0.8s and 1s) and a shorter training time, while the latter took the entire 5s EEG segments into account and a longer time to train.However, CNN showed promising advantages over sLDA and RF: indeed, to reach comparable performance, the latter exploited a semi-quantitative pre-processing pipeline, including ICA to clean data from eye movements artifacts.Moreover, they had to train a classifier at each time point to select the one that led to the best performance.On the other hand, lighter pre-processing is needed to classify the datasets by means of the CNN; and it is completely automatic.Even if two relatively small datasets were available, we could show that our CNN model can achieve classification accuracies in line with two well-established baseline models.Moreover, we obtain similar performance with a simpler pre-processing pipeline, reducing it to those steps (e.g., filtering and automatic trial rejection) that could be performed in an online modality.This may be explained by the fact that the CNN can both behave as an automatic feature extraction method, and as an efficient classifier.Finally, CNN could take larger advantage by the spatial information in the EEG dataset, by applying a spatial convolution at its second layer.On the other hand, sLDA and RF did not use this kind of information to enhance their predictions.

V. CONCLUSIONS
In this study, we considered two different datasets where fine hand movements (touch, grasp, palmar and lateral grasp) were repeated in a self-paced modality, and we evaluated the classification performance of a deep learning model, i.e., a CNN, into respect to two well-established machine learning models, i.e., sLDA and RF.The classification included three classes, i.e., two movements and the rest condition, and it was based on the components of the EEG signals in the 0.3-3 Hz low frequency band.This is the typical band to detect the MRCPs.We showed that CNN achieved good performance in both datasets (average accuracy of 0.70 in dataset 1, 0.64 in dataset 2, with a chance level of 0.40), and they were similar or superior to the baseline models.All classifiers yielded better results in the first dataset (touch, grasp and rest), in line with the electrophysiological observations on the MRCPs that were more pronounced in that dataset.Moreover, for similar reasons, the rest condition always led to the highest true positive rate.We also highlighted that our CNN did not require the use of ICA, that is a common, but heavy burden and semi-quantitative pre-processing step, paving the way for its possible use in an online modality, e.g., in many BCI applications.

Fig. 1 :
Fig. 1: Schematic representation of the proposed CNN model architecture.

TABLE I :
Comparison of classification performance (in terms of accuracy) in validation from Dataset 1. Different window lenghts were tested for sLDA and RF.

TABLE II :
Comparison of classification performance (in terms of accuracy) in validation from Dataset 2. Different window lenghts were tested for sLDA and RF.