Evaluation of Machine Learning Algorithms for Classification of EEG Signals

In brain–computer interfaces (BCIs), it is crucial to process brain signals to improve the accuracy of the classification of motor movements. Machine learning (ML) algorithms such as artificial neural networks (ANNs), linear discriminant analysis (LDA), decision tree (D.T.), K-nearest neighbor (KNN), naive Bayes (N.B.), and support vector machine (SVM) have made significant progress in classification issues. This paper aims to present a signal processing analysis of electroencephalographic (EEG) signals among different feature extraction techniques to train selected classification algorithms to classify signals related to motor movements. The motor movements considered are related to the left hand, right hand, both fists, feet, and relaxation, making this a multiclass problem. In this study, nine ML algorithms were trained with a dataset created by the feature extraction of EEG signals.The EEG signals of 30 Physionet subjects were used to create a dataset related to movement. We used electrodes C3, C1, CZ, C2, and C4 according to the standard 10-10 placement. Then, we extracted the epochs of the EEG signals and applied tone, amplitude levels, and statistical techniques to obtain the set of features. LabVIEWTM2015 version custom applications were used for reading the EEG signals; for channel selection, noise filtering, band selection, and feature extraction operations; and for creating the dataset. MATLAB 2021a was used for training, testing, and evaluating the performance metrics of the ML algorithms. In this study, the model of Medium-ANN achieved the best performance, with an AUC average of 0.9998, Cohen’s Kappa coefficient of 0.9552, a Matthews correlation coefficient of 0.9819, and a loss of 0.0147. These findings suggest the applicability of our approach to different scenarios, such as implementing robotic prostheses, where the use of superficial features is an acceptable option when resources are limited, as in embedded systems or edge computing devices.


Introduction
The central nervous system is composed of the spinal cord and the brain; the human brain resides in the skull and is considered an essential part of the central nervous for classification algorithms. The next step is classification, which is carried out by different algorithms, including LDA, SVM [31], KNN [32], D.T. [33], N.B., and ANN [34].
Currently, there are different fields of science, engineering, and research that evaluate and make use of BCIs to develop applications that present solutions to complex problems [35,36]. These have been possible due to advances in high-density electronics, data acquisition systems that allow high-quality EEG signals to be acquired, intelligent systems that use machine and deep learning algorithms, and neural networks that allow pattern recognition and signal classification to be performed with high precision. In [25], the authors explain that BCIs can be used in the following six application scenarios: replace, restore, augment, enhance, supplement, and research tools. The authors of [37] commented that current and future BCI application areas are device control, user status monitoring, assessment, training and education, gaming and entertainment, cognitive enhancement, safety, and security. Intelligent systems commonly incorporate machine learning (ML) approaches [38][39][40]. ML refers to a system able to learn from training data from certain activities so that the analytical model generation process is automated, and associated tasks can be completed or supplemented [41,42]. Deep learning (DL) is a paradigm within ML based on the use of artificial neural networks (ANNs) [41]. Commonly, ML algorithms focus on classifying EEG signals related to the motor and imaginary movements of hands and feet to carry out control actions, as presented in [43][44][45][46]. DL is useful in areas with vast and high-dimensional data; therefore, deep neural networks outperform ML algorithms for most text, images, video, voice, and audio processing techniques [47]. Nevertheless, for low-dimensional data input, especially with insufficient training data, ML algorithms may still achieve superior results [48], which are even more interpretable than deep neural network results [49]. The authors of [50] used power, mean, and energy as features to classify EEG signals related to the right and left hands through artificial neural networks (ANNs) and support vector machine (SVM). In [51], the authors used SVM to control the direction of a wheelchair by extracting the mean, energy, maximum value, minimum value, and dominant frequency characteristics of the EEG signals. In [52], the authors used the fast Fourier transform and principal component analysis as characteristics of the EEG signals to feed the SVM classifier to control a robotic arm. The authors of [53] reported the use of EEG signals to control an exoskeleton and the use of SVM, LDA, and NN for their respective classification. Studies such as the one presented in [54] have used pretrained neural network models to classify EEG signals through time-frequency characteristics. Recent studies have focused on the proper selection of EEG signal characteristics and its effect on the accuracy of ML and DL algorithms, as presented in [30]. ML and DL techniques are widely accepted and help to develop specific tasks within different applications [55][56][57][58][59][60][61]. Moreover, they are increasingly used to obtain EEG data for pattern analysis, classification of group membership, and BCIs [29,[62][63][64][65][66][67]. However, there are still open research problems, such as the real-time processing of EEG signal classification and the optimization of ML algorithms for implementation on embedded systems or edge computing devices. Hence, research on and development of reliable, efficient, and robust systems for EEG signal classification, among others, should be pursued [16,68]. The complexity of human movements for the manipulation of tools is very high and diverse; for an adult human brain that has automated different movements, it does not represent a major effort, however, for ML it requires the management of precise information inputs that allow programming and execution of free movement. Previous studies offer multiple classes of motor imagery limb movements based on EEG spectral and time domain descriptors [69]; in this sense, there continues to be a need in machine learning to increase the reliability and accuracy of EEG signals used for programming human-like movements.
For the reasons stated above, the aim of this paper is to evaluate nine ML algorithms for the classification of EEG signals. The purpose is to find which ML model presents the best performance metrics for the identification of movement patterns in EEG signals for the control of a mechatronic system, in this case, a robotic hand prosthesis. The selected dataset consists of more than 1500 EEG recordings of 1-2 min in length from 109 subjects and is publicly available in [70]. In this study, we randomly selected 30 subjects to train, validate, and test the proposed method. The ultimate aim is to facilitate the development of robotic limb prosthetics, which is possible because ML algorithms can recognize patterns in EEG signals with complex dynamics. The hypothesis is that ML algorithms perform better in tasks of signal classification than standard methods. The novelty of this study is to provide a methodology for the classification of EEG signals by training several ML algorithms and employing processing, analysis, and feature extraction techniques in the time domain of various lapses of EEG signals related to motor tasks, which can be translated into commands for the control of mechanisms or mechatronic systems such as wheelchairs, robotic prostheses, and mobile robots.
The rest of this paper is organized as follows: Section 2 presents the materials used for the development of the proposed method; additionally, the description of the dataset used for this paper is presented. Section 3 presents the performance metrics obtained from the proposed ML models and the discussion of the main findings obtained in this study. Section 5 presents the proposed usage scenario in a real-world application. Conclusions and future work are described in Section 6.

Hardware and Software
The hardware used for the implementation of the proposed method had the following specifications: Microsoft Windows 10 Pro operating system, system model OptiPlex 3070, system type ×64-based PC, Processor Intel Core i5-9500 at 3.00 GHz, six Cores, six logical processors, memory (RAM) of 16.0 GB DDR4 2666 MHz (2 × 8 GB), and NVIDIA GeForce GT 1030 GDDR5 2 GB PCI-Express ×16. The software used for reading the EEG signals, electrode selection, signal segmentation, preprocessing, analysis, feature extraction, and preparation of the dataset was LabVIEW 2015. Furthermore, the following libraries, which are part of the development environment of LabVIEW, were used: Biomedical Toolkit and Signal Express. The MATLAB 2021a version was used for training and testing the different ML algorithms, which are part of the Statistics, Machine Learning and Deep Learning Toolbox.

Machine Learning Algorithm Training
In this paper, we selected nine ML algorithms to evaluate their performance in the classification of EEG signals related to the motor movements of right hand, left hand, both fists, feet, and relaxation. The nine selected algorithms are naive Bayes (N.B.), k-nearest neighbors (KNN), decision tree (D.T.), support vector machine (SVM), linear discriminant analysis (LDA), Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN. These ML algorithms are part of the statistical and machine learning toolbox of MATLAB, which has various tools that can be used for both the pre-and post-processing of data. Figure 1 shows the block diagram to train, test, and evaluate the selected ML algorithms. First, the dataset is loaded; the chosen dataset is constituted of more than 1500 EEG recordings from 109 subjects that become between 1 and 2 minutes long and can be found in [70]. In this study, 30 people were randomly chosen to train, test, and validate the proposed method. Subsequently, the data are normalized between 0 and 1 to obtain better results. Next, we randomly split the dataset into 80% for training and 20% for testing. Then, the ML model is trained. The next step is to obtain the performance metrics of the ML models (for example, using the confusion matrix), i.e., the performance metrics to evaluate the ML algorithms, such as the area under the curve (AUC) and accuracy, among others.  A typical system for EEG signal classification is conceptually divided into signal acquisition, preprocessing, feature extraction, and classification [23,71]. The EEG signals are acquired by electrodes located on the scalp's surface that transfer information on the electrical neuronal activity to the data acquisition system. In preprocessing, line noise and muscle artifacts are removed from EEG signals. Feature extraction uses several digital signal processing techniques to obtain feature vectors. These vectors are used to train the ML or DL algorithms to classify the EEG signals. The result of the algorithms is a specific class, as illustrated in Figure 2. The following subsections describe the procedure in detail.

Feature Vector
Feature extraction using tone, amplitude statistical measurement in diferent bands of frequency

Input Data
The dataset used for EEG signal classification was developed by Schalk and colleagues at Nervous System Disorders Laboratory and is publicly available on Physionet [70]. The data consist of more than 1500 EEG recordings of 1-2 min in length from 109 subjects. Patients performed 14 tasks (experiments) while 64 electrodes acquired and recorded the EEG signals through the BCI2000 system [72]. The data are in EDF+ format [73], and they contain 64 EEG signals, each displayed at a rate of 160 samples per second, and an annotation channel, which refers to the actions performed during the task. Table 1 shows the protocol of the Schalk agreement experiment. The diagram of the position of the electrodes used to record the data is the standard 10-10 placement. The dataset consists of 109 folders, and each folder contains 28 files, where 14 of these have the *.edf extension, and the other 14 have the *.edf.event extension. The files that contain the EEG signals are those that contain the *.edf extension. The *.edf.event files refer to the events during the development of the different tasks. Although the original set of recorded data consists of continuous multichannel data, and the number of users that comprise it is extensive, we only used the EEG signals of 30 randomly selected subjects, and the tasks that are related to the real movements that take place in tasks 3, 5, 7, 9, 11, and 13. In tasks 3, 7, and 11, real movements related to the right and left fists and relaxation are carried out, while in tasks 5, 9, and 13, real movements of both fists and both feet are carried out. Table 1 summarizes the dataset used in the proposed approach.

EEG Signal Acquisition and Channel Selection
The LabVIEW software 2015 version was employed as the development platform, while the Biomedical Toolkit was used to import the EEG signals, due to the signals being in EDF format. The selected electrodes are shown in Figure 4b. These electrodes present neuronal activity correlated to the execution of the left-and right-hand movements (contained in electrodes C3, C4, and CZ [74,75]) and the neuronal activity related to the movement of both feet (contained in electrodes C1 and C2 [76]); because the different EEG channels tend to represent redundant information, as mentioned in [77], electrodes C3, C1, CZ, C2, and C4 were selected in our study. The selected electrodes were located around the center of the skull, within the motor cortex area; their characteristic is that these electrodes are the least affected by different artifacts [78], which allows the reliable extraction of features to be obtained.

Preprocessing
The EEG signals used, with a sampled frequency of 160 Hz, are available online [70]. Bandpass filters were required to select only the frequencies of interest and eliminate line noise and some other interferences. For this study, we processed the EEG signals through an IIR bandpass filter, with third-order Butterworth topology from 0.1 to 50 Hz. After this, a 50 Hz notch filter was applied to the signals to eliminate noise from the signal power line. Figure 5 shows the original readings of the electrodes used before and after applying the different filters related to the signal preprocessing operations.

Feature Extraction
The features of the EEG rhythm can be obtained by using several digital signal processing techniques. These features were used for training the nine ML algorithms. These analysis techniques included measurements of tone, amplitude, and level, as well as statistical analyses. Table 3 shows the type of measurements and features obtained when these techniques were applied to the EGG signal epochs.

Dataset Preparation
The data vectors consist of 15 features, 3 features for each electrode; the electrodes correspond to positions C3, C1, CZ, C2, and C4, which are related to motor movements, and these belong to one of the five classes of "relaxation", "Right hand", "Left hand", and "Fist and Feet". The dataset has 2792 samples, where 558 samples correspond to the "Relaxation" class, 567 to the right hand, 555 to the left hand, 561 to both fists, and 547 to the feet. On average, there are 557 samples per class, which preserves the balance among the classes. Figure A1 in Appendix A shows a fragment of the dataset created by processing EEG signals when different users performed different motor tasks. Figure A2 in Appendix B depicts the graphic user interface (GUI) of the software (App) developed for the feature extraction process. The proposed App allows features to be extracted in different frequency bands, where each frequency band corresponds to a different class. Table 3 shows the features obtained for training the different ML algorithms. Each line represents a vector of features consisting of five electrodes. Three different measurements were made for each electrode, which resulted in a vector with 15 different characteristics used for the training and testing of the ML and DL models. It can be observed that the feature vector is labeled with its respective class. For each of the five classes, 15 different features were obtained in five different frequency bands to improve the classification accuracy of the ML algorithms [81,82]. The proposed dataset can be downloaded at the link from Supplementary Materials.

Results
To evaluate the performance of the ML algorithms, we used the following scoring metrics: Accuracy, Error, Recall, Speci f icity, Precision, and F1-Score. The performance evaluation of the proposed ML models was initiated by calculating Sensitivity, Speci f icity, Precision, and Accuracy [83,84]. Sensitivity, also known as Recall [84], measures the proportion of positives that are correctly identified as such; it can be calculated by (9). Similarly, Speci f icity measures the proportion of negatives that are correctly identified as such [84]; it can be calculated by (10). Precision is the proportion of true positives among the positive predictions [84]; it can be calculated by (11). Accuracy can be calculated using (12): Speci f icity = TrueNegatives FalsePositives + TrueNegatives , Accuracy = TruePositives + TrueNegatives TruePositives + FalsePositives + TrueNegatives + FalseNegatives .
F1-Score is a method for combining Precision and Recall into a single measure that includes both [85]. Neither Accuracy nor Recall can analyze the complete situation on their own. We might have outstanding Precision but poor Recall, or vice versa, poor Precision but good Recall. With F1-Score, one can represent both concerns with a single score [86]. Once Accuracy and Recall for a binary or multiclass classification task have been computed, the two scores may be combined to calculate the F1-Score metric; it can be calculated by (13): Equations (9)-(13) are valid for binary classification and multiclass issues; however, when used for multiclass problems, they must be calculated for each class and then averaged to obtain each metric per model. Table 4 shows the average scores obtained in each performance metrics by the nine ML algorithms selected in this study. The first parameter analyzed was accuracy, where the LDA model presented an accuracy score of 0.9229; D.T. obtained 0.9803; KNN obtained 0.8996; N.B. obtained 0.9373; SVM obtained 0.9803; Narrow-ANN, Medium-ANN, and Bilayered-ANN obtained 0.9857; finally, Wide-ANN obtained 0.9821. The Narrow-ANN, Medium-ANN, and Bilayered-ANN models obtained the best accuracy score (0.9857). Regarding the error metric, we can see that the LDA, D.T., N.B., SVM, Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN algorithms achieved a score less than 0.1, while the KNN model obtained an Error greater than 0.1; therefore, the models with the lowest error were Narrow-ANN, Medium-ANN, and Bilayered-ANN (0.0143). Considering the recall parameter, we observed that the Narrow-ANN algorithm presented the highest score of 0.9863, while the KNN algorithm obtained the lowest score of 0.9037. Regarding the specificity metric, all the algorithms achieved a score greater than 0.9; the ML models with the best results were the Narrow-ANN, Medium-ANN, and Bilayered-ANN models, all scoring 0.9964. Regarding the precision metric, the Bilayered-ANN algorithm is the one that presented the best result, with 0.9859, while the KNN algorithm presented the lowest score, with 0.9099. Regarding the F1-score parameter, the LDA, D.T., N.B., SVM, Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN algorithms achieved scores greater than 0.91, while the KNN model obtained a score below 0.91. The algorithm that presented the best F1-score result was Narrow-ANN, with 0.9859.  Table 5 presents the performance metrics achieved by each ML algorithm. The metrics used to evaluate the performance of the ML algorithms were the area under the average curve (AUC average), Cohen's Kappa coefficient [87], Matthews correlation coefficient [88], and model loss. Concerning the AUC average metric, all algorithms achieved a score greater than 0.90, where the top three ML models were the SVM, Medium-ANN, and Bilayered-ANN models, which obtained the highest scores (AUC scores). Regarding Cohen's Kappa coefficient, a score above 0.8 indicates exemplary commitment, while zero or less indicates poor commitment. The LDA and KNN algorithms obtained Cohen's Kappa coefficients less than 0.80 but greater than zero, while the D.T., N.B., SVM, Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN algorithms achieved Cohen's Kappa coefficients of 0.9384, 0.8040, 0.9384, 0.9552, 0.9552, 0.9440, and 0.9552, respectively, where the Narrow-ANN, Medium-ANN, and Bilayered-ANN algorithms achieved the highest scores. In addition, we used the Matthews correlation coefficient, which has been widely used as a performance metric for ML algorithms since 2000. The best scores obtained were presented by the D.T, N.B., SVM, Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN models (0.9736, 0.9225, 0.9757, 0.9824, 0.9819, 0.9783, and 0.9820, respectively), with Narrow-ANN obtaining the best score, while the KNN algorithm achieved the lowest score of 0.8810. The ML model with the lowest loss was Narrow-ANN, with 0.0136, followed by the Medium-ANN and Bilayered-ANN models, both with 0.0147, while the ML algorithm with the highest loss was KNN.  Figure 6 shows the ROC curves of the top four ML algorithms trained for the classification of EEG signals related to the state of relaxation, right hand, left hand, both hands, and both feet. These algorithms are LDA, SVM, D.T., and N.B. The algorithm that presented the best performance metrics was SVM, with an AUC average of 0.9988. The ROC curves showed a compromise between sensitivity and specificity. The SVM algorithm was the closest to the upper-left corner of the ROC space, while the D.T. model was closer to the 45-degree diagonal. Classifiers that obtain curves closer to the upper-left corner indicate better performance, while classifiers with ROC curves closer to the 45-degree diagonal of the ROC space are less accurate.

ROC for EEG Classification (1 vs Others)
AUC   Figure 7 shows the ROC curves of the top four DL algorithms (neural networks) trained for the classification of EEG signals. These algorithms are Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN. The algorithm that presented the best performance metrics was Medium-ANN, with an AUC average of 0.9998; it was the closest to the upper-left corner of the ROC space.
In machine learning, the presumably best model is chosen from a collection of model candidates obtained by evaluating various model types, hyperparameters, or feature subsets, among others. In this paper, it is proposed to use ConfusionVis, a model-agnostic technique for evaluating and comparing multiclass classifiers based on their confusion matrices [56]. Figure 8 depicts the ConfusionVis achieved for the nine ML models chosen for EEG signal classification. Figure 8a shows the average accuracy score per ML model, where it can be observed that Narrow-ANN had the best accuracy score. Figure 8b illustrates the confusion matrix similarity results, where it can be seen that the D.T., SVM, Narrow-ANN, and Medium-ANN models obtained the best similarity. Figure 8c depicts the error by class scores, where it can be observed that Medium-NN and Narrow-ANN achieved the lowest error score in most classes of movements classified from the EEG signals. Figure 8d shows the error by model scores, where it can also be seen that Medium-ANN obtained the lowest error score, followed by Bilayered-NN, Narrow-ANN, and decision tree (D.T.).    Figure 9 shows the training time of the nine ML algorithms tested, with N.B., LDA, and KNN having the shortest training time. However, the results shown in Tables 4 and 5 show that these algorithms had the lowest performance metrics, with the exception of D.T. In contrast, the SVM, Narrow-ANN, Medium-ANN, Wide-ANN, and Bilayered-ANN algorithms had the most considerable training times of 0.13546, 0.37135, 0.16956, 0.36255, and 0.45722 s, respectively, with the Bilayered-ANN algorithm having the longest training time. However, these algorithms had the best performance metrics, as shown in Tables 4 and 5 and Figures 7 and 8. Therefore, the data science engineer or researcher must perform a cost-benefit analysis regarding accuracy and processing time. In most circumstances, engineers favor accuracy over training time, because training is only performed a few times and only the trained ML model is employed. For this reason, in this study, it is more convenient to select the Narrow-ANN model.

Discussion
In this study, we observed that the different features used were helpful for the classification of EEG signals, as proposed in our hypothesis. The presented features are based on the time domain: amplitude, frequency, phase, peak-peak value, negative peak, positive peak, median, mode, average, mean square error value, standard deviation, summation, variance, kurtosis, and skewness. We consider that they are good features for classifying EEG signals related to movements. Using these features, the ML model that achieved the best performance was Medium-ANN, with average area under the curve of 0.9998, Cohen's Kappa coefficient of 0.9552, Matthew correlation coefficient of 0.9819, and loss of 0.0147.
We observed that the performance metrics obtained from the nine machine learning algorithms were good. Using standard features in different frequency bands and related to a particular class allowed machine learning and deep learning algorithms to obtain excellent performance metrics, as shown in Tables 4 and 5 and Figures 6-8; this is because the proposed frequency bands and features improved the separability of the data, making the classification algorithms substantially better.
Regardless, the data science engineer/scientist is in charge of carrying out the corresponding analysis in terms of costs-benefits and precision concerning the information processing time. In most cases, ML models with better precision are chosen, and training time is usually sacrificed. Since the training of the ML algorithms is performed once, only the trained model is used for the assigned task. The Medium-ANN algorithm was selected for this reason and because its performance metrics were the best. Therefore, feature extraction is worth mentioning among the processes that improve relevant information acquisition and ensure better performance metrics when training EEG signal classification algorithms, as shown in different studies. Our results are consistent with other spectrogram methods implemented for identifying EEG patterns in persons with motor impairment using similar brain sources that were analyzed in this study [15]. Many human behavior fields are still a challenge for BCIs; findings from this study may provide complementary data for other studies reporting findings from central nervous system damage with residuals of motor impairment of upper limb movement [18]. In addition to limb paralysis, limb loss represents an obstacle to quality of life for which the results of this study offer a comprehensive and reliable technique for extracting electrical brain sources for human movement programming. As in other research [20], results of the present study provide consistent and accurate information for future controlling inputs for the adaptation of prosthesis. As reported elsewhere [69], we conclude that it is necessary to increase movement classes in EEG features extraction for providing mechatronic systems controlled by means of BCI, suitable and reliable patterns corresponding for target movements.

Proposed Usage Scenario
The ML algorithms proposed in this research study could be implemented in highperformance embedded systems or edge computing devices as verified in previous studies [59,89]. These act as the central control system, which is in charge of communicating with the BCI to acquire EEG signals. Likewise, the control system is in charge of carrying out the digital processing of the EEG signals, the extraction of features, the classification, and the translation (decoding and execution) of the control commands. The mechatronic control system would have a trained ML model which would allow a user with some motor disability to perform some motor activities, such as opening and closing the right fist, left fist, or both fists through the classification of EEG signals. Figure 10 depicts a conceptual diagram of the prospective mechatronic control system. We could consider this model the first step in developing intelligent prostheses that integrate the system's several components. The future characteristics to be developed are lower cost, size, portability, low power consumption, and reliable communication with the BCI.

Limitations of the Study
One of the drawbacks of this research study is the need for a BCI; users should have short hair, as the BCI must be comfortable and enjoyable to them. Furthermore, the electrodes must be maintained in saline solution. Successful implementation also relies on the BCI battery life. Finally, if the emotional state of the participants is altered, accurate measurements cannot be acquired.

Conclusions
In this study, a methodology for classifying motor movements by processing the EEG signals of 30 users is presented. The classification of the EEG signals was related to left hand, right hand, both fists, feet, and relaxation movements. As a result of EEG