Education 4.0: Teaching the Basis of Motor Imagery Classiﬁcation Algorithms for Brain-Computer Interfaces

: Education 4.0 is looking to prepare future scientists and engineers not only by granting them with knowledge and skills but also by giving them the ability to apply them to solve real life problems through the implementation of disruptive technologies. As a consequence, there is a growing demand for educational material that introduces science and engineering students to technologies, such as Artiﬁcial Intelligence (AI) and Brain–Computer Interfaces (BCI). Thus, our contribution towards the development of this material is to create a test bench for BCI given the basis and analysis on how they can be discriminated against. This is shown using different AI methods: Fisher Linear Discriminant Analysis (LDA), Support Vector Machines (SVM), Artiﬁcial Neural Networks (ANN), Restricted Boltzmann Machines (RBM) and Self-Organizing Maps (SOM), allowing students to see how input changes alter their performance. These tests were done against a two-class Motor Image database. First, using a large frequency band and no ﬁltering eye movement. Secondly, the band was reduced and the eye movement was ﬁltered. The accuracy was analyzed obtaining values around 70 ∼ 80% for all methods, excluding SVM and SOM mapping. Accuracy and mapping differentiability increased for some subjects for the second scenario 70 ∼ 85%, meaning either their band with the most signiﬁcant information is on that limited space or the contamination because of eye movement was better mitigated by the regression method. This can be translated to saying that these methods work better under limited spaces. The outcome of this work is useful to show future scientists and engineers how BCI experiments are conducted while teaching them the basics of some AI techniques that can be used in this and other several experiments that can be carried on the framework of Education 4.0.


Introduction
Nowadays, new technologies are evolving at an exponential pace, and the consequential technological advancements achieved through them are blurring the lines between the physical, digital and biological worlds [1]. These advancements constitute the basis of the fourth industrial revolution (also called Industry 4.0), which is principally constituted of progress in the areas of artificial intelligence (AI), robotics, nanotechnology, quantum computing, energy storage systems and the internet of the things (IoT) [2]. As Industry 4.0 continues changing the world, new challenges arise in different branches of society, one of them being education; thus, Education 4.0 comes into existence.
In general, science and engineering education needs, learning and teaching methods are continuously and rapidly changing in order to adapt to the incoming innovation challenges caused by the digital transformation of industries. Therefore, one of the main objectives of Education 4.0 is to generate updated curricula at the undergraduate level that allows students to develop technological progress and knowledge that, in a future, can be usable for the welfare of society. learning scenarios through practical projects are some of the pillars that constitute the basis of Education 4.0 [30]. Hence, a way to introduce this paradigm on future engineers and scientists is to use practical approaches that help them acquire new skills, learn how theory and practice are linked, understand how to correctly structure and test hypotheses, know how to develop problem-solving techniques or simply to understand how to work with new equipment and to gather, manipulate and/or interpret data [31].
One way to achieve this is to provide students the option to learn over a testing bench or workbench, in which they can try out different protocols and verify their correctness without trying them on a human. This would help them acquire knowledge about the development of experiments, as well as how to manipulate data and understand results. It is very important to remark that the learning process over a workbench must be carried out over a similar context to the real subject to learn [32] and that it must be done using technologies that are similar to the ones that would be used for a real-life application [33].
To develop this workbench, it is important to understand how science and technology are usually taught. In general, the aim of education in science and technology is to inform people who live in a world with high dependency on technology. It is important to notice that science cannot be taught disjointed from the world because of the many relationships between science and society, especially through the countless applications of science and technology [34,35]. Thus, it is of high importance for future scientists and engineers to learn science and technology based on their own experience and their knowledge about the world and their surroundings [36]. This translates into learning through practical approaches over things that are related to them as individuals.
Having said that, the main objective of this study is to serve as educational material for science and engineering students and teachers that are dabbling in the Education 4.0 paradigm. Ultimately, this will help students to acquire expertise on a disruptive and transdisciplinary technology, such as BCI, while developing computational skills, adaptive and sense-making thinking. In order to achieve this objective, this work first explains the basics behind Brain-Computer Interfaces and five different artificial intelligence algorithms: Kohonen Self-Organizing Maps (SOM), Artificial Neural Networks (ANN), Linear Discriminant Analysis (LDA), Supported Vector Machines (SVM) and Restricted Boltzmann Machines (RBM). Furthermore, for this work to be fully in line with the Education 4.0 paradigm, we present a test bench for students to learn the applicability of the previous algorithms towards BCI and how to interpret the outcomes of the given experimentation. The proposed test bench in this work consists of a two-class Motor Image database obtained by [37], which includes three different bipolar EEG recordings and three monopolar EOG recordings.
This article goes first through a review of Brain-Computer Interfaces with BCI control paradigms, signal processing, including signal acquisition and feature extraction, and Pattern Recognition methods. In the latter, an introduction to AI techniques is given, exploring two linear and two neural network classifiers and one more neural network that creates an internal representation of the signal. Furthermore, a bibliographic comparison is conducted to cover their corresponding advantages and disadvantages. Afterward, these algorithms are tested over a BCI database to show and compare their potential and performance.

Brain-Computer Interfaces
Among the main objectives of technological progress of Industry 4.0 is the intention of searching and implementing new ways of communication, interaction and remote control of devices. Thus, including BCI-AI teaching into the Education 4.0 paradigm is one way to introduce students to this type of technologies.
In general, BCI can be decomposed into four steps. The first one is signal acquisition, which requires an understanding of the intrinsic properties of the signals, what specific signals are to be recorded, where are they going to be captured and the sensors to be used (easy or hard attach). It is then followed by applying filtering and/or transformation that unmask the intrinsic information within the signals and enhance their patterns and properties with some initial discrimination. Later, these signals are discriminated against to understand their intention, which is normally done using a machine learning algorithm. Finally, the resulting pattern is translated into control signals for device manipulation. These steps are shown in Figure 1 [38].  Figure 1. Basic components of a BCI. The image illustrates the map between the input and output through the translating algorithm. Signals are acquired by electrodes and then translated into a control signal for an external device (e.g., wheelchair, neuro-prosthesis or exoskeleton) using processing steps.

BCI Control Paradigms
The second component is feature extraction, which in the case of BCI, unfolds the brain signal characteristics from nonessential material and shows them in a more meaningful form, manageable for either humans or computers. Hence, it is important to show which commands would be used for control and the features that better represent the signal to be analyzed.
BCI control paradigms depend on choosing the feature and the type of signal used as a pattern for the BCI. There are mainly two types of EEG paradigms used on BCI: Evoked potentials (EP) [39] and changes in the spontaneous oscillatory EEG activity.
Generally, there are two types of brain paradigms: Evoked Potentials and changes in the spontaneous oscillatory EEG activity.

Evoked Potentials
Evoked Potentials (EPS) are changes in the electrical potentials that are locked in time to certain events (i.e., visual or tactile). Normally, these brain signals are averaged over one second to be used as control signals. The main techniques are P300 (P3) wave of visual evoked potentials and Steady State Visual Evoked Potential or SSVEP. The P300 was first described by Sutton [40] as events that occur that have an alteration at 300 milliseconds after a visual event is presented (Figure 2a). The most used P300 was described by Farwell-Donchin [41], where a matrix of letters and numbers (or symbols) are presented on a six-by-six grid ( Figure 2b) and flashed horizontally and vertically, and when a line that has the corresponding symbol blinks, the P300 stimulus appears. Further, P300 can be employed as a lie detector by presenting a visual stimulus related to the lie that occurs if the subject knows the stimuli [42,43].
Similarly, SSVEPs are brain responses to visual stimuli, such as flickering lights, that manifest frequency-locked signals with an increased amplitude of the stimulated frequency located over the occipital lobe. Due to that, they do not require eye movement, and they can be used for people that still have eye acuity but cannot move their eyes [44].

Oscillatory Activity Patterns
These kinds of signals are voluntarily induced by the user, such as hand movements that are associated with a power change or synchronization/desynchronization over certain rhythms. This effect also happens using imagination over body movement. In this case, the desynchronization and synchronization are known as event-related desynchronization (ERD) and event-related synchronization (ERS), respectively, [45].
Normally, they appear after the termination of the event. Unlike EP, these signals do not require locking to a stimulus; hence, they can be used at their own pace. The two most common are Motor Imaginary and Slow Cortical Potentials. The first signals are changes that occur with the imagination of motor movement [46][47][48]. Using imagination opens the path of using areas that are not normally used for the control of devices. Slow Cortical Potentials are generated slow voltage changes in some wave patterns over the cortex that can be produced by prolonged trained users to select words or pictograms from a computer [49].

Signal Processing
Decoding brain states is problematic since they have a poor signal-to-noise ratio, variability between trials (in different sessions or even on the same session), high dimensionality data, highly location-dependent data, etc. [50]. Thus, for correct decoding, the usage of brain signals requires several steps, starting from signal acquisition (e.g., EEG and ECoG recordings), feature extraction, pattern recognition and, finally, translation into control signals ( Figure 1).

Signal Acquisition
The brain is composed of billions of neurons that communicate using electrical signals. These signals are produced at similar locations between individuals, yet it is not fully understood why they are emitted there and what their intentions are. However, it is still important to know when they are produced and their location, which reflects the normal or abnormal activity of the brain and user intentions.
Many techniques have been developed to record brain activity (e.g., EEG, ECoG, single-neuron recording, PET, fMRI, MEG and FNIR) [14,51]. Despite all of them being able to record brain activity, in this work, we will focus on EEG. The reason behind this is that the other enlisted techniques are either invasive, expensive or have high latency.
EEG recordings are done using electrodes attached to the surface of the scalp, and each electrode measures the potential difference from a reference electrode and itself [51]. Correspondingly, these potentials reflect activity within the brain, and to avoid unwanted signal noise due to poor connection, the electrodes must have good contact with the area of interest. Furthermore, for understanding and repeatability, it is of great interest to know exactly where the electrodes are commonly located. For that reason, international 10-20 electrode systems are used, which consist of making an arc grid over the scalp that starts at specific locations, where the Nasion and Inion are the longitudinal references and the right and left preauricular are the lateral references (see Figure 3). The corresponding name of each arc crossing represents a location of the brain lobes, which is helpful for the spatial analysis of recorded signals.  Furthermore, since EEG signals go through several layers of muscle, skin and bone, having a correct measure of the brain signal requires a process of amplification and filtering to improve the signal quality.
First, the amplification helps increase the low-signal amplitude (∼10-20 µV), which is not easy to interpret using common displays, recorders or AC/DC converters. Notably, amplifiers must fulfill some requirements such as noise rejection and guarantee equipment and patient protection.
Then, filtering is done to reduce either the environmental noise (e.g., power lines and electrical and/or surrounding medical equipment) or the physiological noise (e.g., muscle activation, eye movement, and/or blinking) [52]. Dealing with environmental noise is usually easier than dealing with physiological noise. Environmental noise can be avoided by removing most of the sources of electromagnetic signals from the recording room and its vicinity. Furthermore, one of the most common techniques is to use a notch filter at 50 or 60 Hz that helps by removing the noise of the electric power lines' artifacts. For physiological noise, one of the most common approaches is to incorporate physiological signals in the recordings and subtract them from the EEG. Other methods include telling the subject to remain still, not blink and hold the gaze during the study; however, this is usually difficult and can introduce even more noise because of the voluntary attention needed to control those body actions.

Feature Extraction Methods
The second component is feature extraction. Once you have selected the correct control signal, it is necessary to find a way to better represent it. BCI mainly use four kinds of feature extraction methods to represent the signal: temporal methods (i.e., signal amplitude and auto-regressive), frequency methods (i.e., band power and power spectral densities), time-frequency methods (STFT and wavelets) and some others (e.g., coherency, phase synchronization, etc.). The selection of one of these methods will depend completely on the desired control command for classification. Thus, depending on the transformation, it is recommended that EEG recordings have a high sampling rate and more than a single electrode for a better signal recording.

Pattern Recognition
The third component is pattern recognition, which is the one that translates the feature into a control signal. The main problem in this step of the BCI is that the brain signals are highly variable and would be hard, if not impossible, to manually translate into control signals. Then, in the light of solving this problem, the use of Artificial Intelligence (AI) is highly beneficial. Given that there are many techniques and their applications in science are vast, it is required to understand the basics of AI techniques, what each technique can do and how they are developed. Thus, students must see what a real application of these techniques can do, especially in BCI.
In particular, this work focuses on five AI algorithms: Kohonen Self-Organizing Maps (SOM), Artificial Neural Networks (ANN) trained by Backpropagation, Linear Discriminant Analysis (LDA), Supported Vector Machines (SVM) and Restricted Boltzmann Machines (RBM), as well as their applicability to BCI. These techniques were chosen because each one brings different properties that are interesting to be analyzed. In the case of SOM, as an unsupervised network that does not require labels, it is capable of creating an internal representation of the system. On the other hand, neural networks facilitate their training by using the error to correct its internal representation. Linear discriminant analysis is the most used technique used for BCI due to its simplicity and adaptability but with the limitation of working only for binary classification. Furthermore, Support Vector Machines is selected to be tested in this work since it is one of the most used classification techniques and has high separability capabilities. Finally, the Restricted Boltzmann Machine algorithm is analyzed as a different technique for BCI that explores and characterizes both the signals and their classes together, creating an internal map of them.

Kohonen Self-Organizing Maps
A Self-Organizing Map (SOM) [53] is an unsupervised neural network that produces a discrete representation of an input space, which is referred to as a map. This algorithm is used as a clustering or dimensionality reduction method and consists of an input layer and a computational layer (Figure 4a) conformed to nodes or neurons. Each node has a topological position and has a number of weights equal to the number of inputs The SOM method calculates the Euclidean distance between an entry vector and the weights of each node: and chooses the lowest distance to one node as the best or winning node. This node is referred to as the Best Matching Unit or BMU. Once the BMU is found, the nodes in the neighbor (i.e., influence area) of the BMU and the BMU itself are selected, and their weights are updated ( Figure 4b). The BMU's influence area is calculated as σ(t) = σ 0 e −t τσ , with σ 0 as the lattice width at the instant t 0 and τ σ as the updating constant of σ. After the area is selected, the weights are updated using the equation below:

a) SOM layers b) SOM Area of influence
where η and Θ represent the learning rate and influence rate at the instant of time t. As training time advances, the learning rate and influence rate diminish their effect by: where η 0 is the initial learning rate, and τ η is the update constant of η. With this learning technique, the inputs with similar characteristics will cluster together around a given node, while inputs with different characteristics will cluster apart in other different nodes. The steps of this method can be seen in Algorithm 1.

Input network:
Training set S = {X 1 , X 2 , . . . , X s }; Learning and influence rate α & θ = ((0 − 1]) Init network: Initialize the weights to a small random value Train network: Loop until w new ! = w old && iter < max iterations: Update weights: Update learning and influence rate: Output network: Weights w

Fisher's Linear Discriminant Analysis
Linear Discriminant Analysis (LDA), also known as Fisher's LDA, uses a linear hyperplane to separate the data representing each of the two classes (see Figure 5). The hyperplane is obtained by projecting high-dimensional data onto a line. The objective of this projection is to maximize the distance between the means of the two classes while minimizing the variance within each class. This defines the Fisher criterion, which is maximized over all linear projections, w: whereμ i represents the mean of the projections of classes 1 and 2 (μ i = w T µ i ), andS 2 represents the variance of these projections( , where y i is the projected samples y i = w T x i . Based on these equalities, we can rewrite Fisher's criterion as a function of w in the following way: where S B and S W measure the separation between means of both classes and the withinclass scattering, respectively. Given the previous equation, we can find its maximum by solving the generalized eigenvalue problem as follows: Solving this problem will result in a collection of eigenvectors w and their corresponding eigenvalues λ. Then, these eigenvectors must be sorted according to their eigenvalues from biggest to smallest, and finally, a set of k eigenvectors is chosen to create a weight matrix w, which is the representation of the new space in which the data are going to be projected. Algorithm 2 gives the basic steps for LDA.

Input network:
Training Calculate the means µ Calculate S B and S W Get the eigenvectors and values: (e 1 , e 2 , . . . , e n ), (λ 1 , λ 2 , . . . , λ n ) Obtain the matrix S x Sort the eigenvectors and chose the k ones with the bigger eigenvalues Form a matrix W(n × k) Output network: Returns Matrix W

Supported Vector Machine
Supported Vector Machines (SVM) are supervised models that search how to separate two classes using a discriminative hyperplane. SVM searches for a hyperplane that maximizes the separation margins of the system, i.e., the distances between the classes of the training points. In Figure 6a, these distances are shown as d 1 and d 2 .  As it can be seen in the previous Figure, the hyperplanes divide the input data in two different regions, one considered to be positive (y i = 1 for H 1 ) and the other to be negative (y i = −1 for H 2 ). Thus, the hyperplanes, as shown in Figure 6a, are defined as The main objective is that the classifier has a margin as big as possible, i.e., maximize the distance between both hyperplanes, defined as d = 2 ||w|| . This is the same as minimizing the function 1 2 w T w constrained to the condition of y i (w T x i + b) ≥ 1. Although SVM is a linear classifier, it can be extended to non-linear using the 'kernel trick', which maps the data into a different space (Figure 6b) where the data can be linearly separated. Furthermore, SVM is a binary classification but can be easily converted into a multi-class classifier using the technique of one vs. the rest, where a classifier is made for each class and discriminated against the rest of the classes. The winning class is the one with the higher final confidence value. The steps for SVM can be seen in Algorithm 3.

Backpropagation
Backpropagation (BP) is a common technique for training Artificial Neural Networks (ANN), where, in few words, the error is propagated backwards so that the network can learn by itself and adapt depending on previous mistakes [54] (Figure 7). The main objective of the BP algorithm is to minimize the error function in a weight space using the gradient descent method. The combination of weights that minimize the error of the function is considered the solution to the learning problem. To use the gradient descent method, first, we must guarantee that the error function and the activation function are continuous and differentiable. One of the activation functions that is usually implemented for BP is the sigmoid function, S c (x) = 1 1+e −x , of which a derivative exists and is continuous ). The activation function of a neuron calculates the sum of the inputs x 1 , . . . , x n times the weights w 1 , . . . , w n plus the bias of that particular neuron θ. In the particular case of the sigmoid function: where the output of the system is composed by the outputs of each neuron, (o 1 , . . . , o m ). The method of BP searches to minimize the error between the generated output o i and the original output, i.e., the target t i . This error can be represented through the mean sum squared loss function: To minimize error, the weights need to be corrected using the gradient descent: where each weight is updated using ∆w n = −γ ∂E ∂w n i , with γ being the learning constant. The weights are updated iteratively until ∇E ≈ 0 using: With this, all weights are updated with the intent of error minimization. Once the error is minimized, the network can be used on unseen data to check out its performance.  The resulting method can be described in Algorithm 4.

Input network:
Training set S = {X 1 , X 2 , . . . , X s }; Learning and influence rate α and θ = ((0 − 1]) Init network: Initialize the weights w n and bias b to a small random value Train network: Loop until w new ! = w old && iter < max iterations: repeat Chose random input X i Forward Propagation: for All MLP Layers do Use 7 to each of the layers until the output layer end for Backward Propagation: Calculate quadratic error according to 8 for All MLP Layers do Calculate each of the deltas using ∆w n = −γ ∂E ∂w n end for Update weights using w new n = w old n + ∆w n until w n when it converges Output: Weights w n Use trained network for classification Results: Activity Labels A of the unlabeled data

Restricted Boltzmann Machines
A special type of ANN is the one developed by Hinton [55], known as Restricted Boltzmann Machines (RBM). RBMs are two-layer neural networks of stochastic units, divided into visible units v = (v 1 , · · · , v i ) and hidden units h = (h 1 , · · · , h j ), which have symmetrical connected weights (see Figure 8). The visible units represent the data, while the hidden units are known as feature extractors. RBMs pretend to be model dependencies over visible variables. The probability p(v, h; Θ) ∝ e −E(v,h;Θ) is known as the Boltzmann distribution, which has an energy function described as: with W as the symmetric weights, b and c as the bias of the visible and hidden units, respectively, and Θ = (W, b, c). The two conditional distributions over the variables, i.e., hidden given the visible and visible given the hidden, are given by: and where σ represents the activation function. Since the hidden variables cannot be observed, we need an algorithm that improves the RBM representation of the system. This algorithm is called Contrastive Divergence (CD) [56], which allows fitting the probability p(v) to a certain set of observations (e.g., EEG signals). The pseudo code for RBM can be seen in Algorithm 5.
% Notation: x ← b means x is set to value b % x ∼ p means x is sampled from p Input network: Training pair v = (x i , y i ) Learning rate α Init network: Output network: Return weights and biases

Advantages and Disadvantages of the Methods
It is intended to test the efficacy of the proposed methods for BCI, but, before doing that, it may be convenient to clarify the advantages and disadvantages of each method. Table 1 is a listing of the main characteristics of each of them.

Accuracy and Cross Validation
In general, it is important to know how well a given method performs over a specific task; thus, it is important to calculate its accuracy. Therefore, to calculate the accuracy, the Mean Square Error (MSE) was used, as shown in Equation (14), with t i and y i as the observed and predicted outputs.
Another important measure to be done is how the classification method is going to behave when dealing with independent data, i.e., how general the method is. It is important to check that the method does not overfit, which means that it obtains a perfect score when dealing with training data but has a poor performance when it is exposed to unseen data. One way to overcome this problem is to observe the performance of the classifier over a training dataset and then verify it using a test dataset; this is the basic idea behind a technique called cross validation. However, we still have the problem that the behavior of the system may depend heavily on which data points are used for training and which ones are used for training. Thus, the algorithm may yield different results depending on how the data was divided into the training and testing datasets.
One of the most used methods to overcome this problem is known as K-fold Cross-Validation. This technique is based on splitting the data set into k smaller sets and repeat the training and testing k times. Each time the algorithm uses one different testing dataset, the other k − 1 datasets are used for training. Then, the validation results are averaged to obtain the overall performance of the algorithm (see Figure 9). This testing procedure allows us to describe how well the classifier performs using different datasets.

Method
To test the previously described techniques, we used the dataset from [37] (a link to this dataset can be found in Appendix C), consisting of EEG and EOG recordings from ten naive right-handed subjects (six male and four female) with an average age of 24.7 ± 3.3 years. Furthermore, the participants had normal or corrected-to-normal vision during the experiments. The gathered data consisted of three bipolar EEG recordings (C3, Cz and C4) with a sampling frequency of 250 Hz and the electrode Fz as the EEG ground, as shown in Figure 10a. The recorded signals had a dynamic range of ±100 µV, which were analog bandpass filtered (0.5-100 Hz) and notch filtered (50 Hz). At the same time, EOG data were recorded using three monopolar electrodes (Figure 10b) with a dynamic voltage range of ±1 mV. Each subject participated in five sessions, two without feedback and three with feedback. At the beginning of each session, a 5-minute recording of continuous eye behavior was made to estimate the EOG artifact correction coefficients. These recordings were divided as follows: eyes open during 2 min, eyes closed during 1 min and eyes moving during 1 min (see Appendix A).
The sessions without feedback were done using a cue-based paradigm (Figure 11a), in which each subject had to perform motor imagery (MI) depending on the visual cue shown in the monitor. Each trial started with a fixated cross and an additional short warning tone. Then, after some seconds, the visual cue consisting of an arrow pointing either to the right or to the left appeared for 1.25 s. Afterward, the subject had to maintain the corresponding MI for a period of 4 s. In between trials, a short break of a random period between 1.5 and 2.5 s was given to avoid adaptation.
The three feedback sessions consisted of four runs with twenty trials for each type of motor imagery. These sessions were carried out using smiley feedback (see Figure 11b), which the initial state was centered and gray-colored. At the second two, a warning tone was emitted, which preceded a cue that lasted from second 3 to 7.5. According to the given cue, subjects had to move the smiley to the left or right by imagining hand movements towards those directions. The smiley changed color from gray to green or red and the curvature of the mouth from happy to sad if the direction was either correct or incorrect according to the cue, respectively.

Data Processing
Two different approaches were designed for the data processing step. The first one consisted of testing the performance of the classifiers with a large range of frequency bands and without any EOG removal. To do this, the data were transformed into a frequency domain between  Hz, which is the range in which changes in amplitude occur. Then, the second approach was to reduce the frequency bands into two different ranges, (8)(9)(10)(11)(12) Hz and (22)(23)(24)(25)(26)(27)(28)(29)(30) Hz, and the EOG was removed using the previously proposed regression.
A specific SOM was trained for over 50 epochs for each one of the nine subjects to obtain an internal representation of both classes in order to observe if they could be easily discriminated against. Each SOM had 100 units distributed in a 10 × 10 matrix (see Figure 12), an initial learning rate η 0 = 0.2, an initial lattice width of (σ 0 = 10) and updating constants τ η = 100 and τ σ = 4.
Being an unsupervised method, the training dataset was used to tune the weights of the SOM, and the testing dataset was used to observe the final internal representation that it could generate. In this case, the first three sessions were used as training data and the remaining two as testing data. The other techniques were supervised methods, where the testing dataset was used to check the final classification accuracy of each method. K-fold cross-validation was done to better evaluate the results of these algorithms, with k = 5. In other words, the data were split so that one of the recorded sessions was considered as testing data and the reminding sessions as training data. This process was repeated five times, then the accuracy was averaged.
As LDA calculates the mean and scatter matrices, it does not require any specific training parameter, so the process is as straightforward as shown in Section 4.2. For the SVMs, since this problem is a binary classification problem, there was no need to use any expansion method. However, the SVMs were trained on a radial basis using the kernel function K(x i , x j ) = exp(−(1/2σ 2 )||x i − x j || 2 ), which is one of the most common kernels used for BCI. The box constraint parameter C = 1e − 2 was used since it gave the best overall results.
A 1000-neuron Artificial Neural Network was trained using backpropagation, which had a learning rate of 0.05 and a momentum of 0.01. The weights were initialized through a normal distribution N(0, 0.01 2 ). The ANN has trained over 100 sweeps (or epochs) with batches of 100 randomly selected EEG trials.
Finally, the RBM initial training parameters (weights, biases and rates) were obtained from [56] and adapted after some preliminary analysis. The RBM was trained over 100 epochs, each comprising Contrastive Divergence updates derived from 10 Gibbs sampling iterations (CD10). The training datasets were composed of mini-batches of 100 randomly selected EEG trials. The weights were drawn from a normal distribution N(0, 0.1 2 ) for the Gaussian-Binary connections and N(0, 0.01 2 ) for the Softmax-Binary connections, with each bias initialized at zero. The weights and biases were updated with a learning rate of 10 −3 and a momentum of 0.5 with an increment of 0.1 at 40% and 80% of the learning process. The step-up value of 0.1 was selected because higher increments made the learning unstable. A cost value of 2 · 10 −4 was selected since it facilitated the learning process of CD by increasing the mixing rate of the Markov chain.
All the algorithms were implemented using Matlab™ on Windows™ 7 professional 64-bit operating system. The computer used to run the algorithms had an Intel ® CPU E5-2618L v3 @ 2.30 Ghz with 16 cores and 24 GB RAM.

Results
For the first part of the results, the classification accuracy (or mapping accuracy in the case of SOM) was obtained for each algorithm using a frequency band of (8-30) Hz and no regression filtering over the EOG. Then, the band was reduced for all subjects to (8)(9)(10)(11)(12) Hz and (22)(23)(24)(25)(26)(27)(28)(29)(30) Hz using an average of the frequencies obtained in [37], and the EOG artifacts were also reduced through the regression procedure.

Frequency Band of (8-30) Hz and no EOG Filtering
First, for a better comparison, the frequency response of every subject is shown in Figure 13. It is then followed by the results of SOM shown in Figure 14, which shows the final internal representation of the two classes inside the SOM network for each subject. Furthermore, Table 2 shows the winner neuron of the SOM for both classes for each subject. Table 3 shows the resultant training and testing accuracy of the four remaining methods.

Frequency Bands of [8-12] Hz and [22-30] Hz, with EOG Reduction
As in the previous section, first, the average frequency response of each subject is shown in Figure 16. Then, the final internal representations obtained through the SOM are shown in Figure 17, while Table 4 shows the position of the winner SOM neurons for both classes for each subject.   Table 4. The results of the position of the winner neuron for Class 1 and Class 2 of SOM with a frequency band of [8][9][10][11][12] Hz and [22][23][24][25][26][27][28][29][30] Hz and with regression filtering over the EOG.  Table 5 enumerates the results for the training and testing accuracy using LDA, SVM, BP and RBM with the reduced frequency band and with the EOG reduction regression method. Finally, Figure 18 shows the training time (seconds) required by each of the five methods with the selected frequency bands ((8-12) Hz and (22-30) Hz) and EOG correction.   Test  LDA  64  52  49  79  66  54  62  71  62  SVM  51  50  50  75  63  51  52  63  56  BP  66  52  52  80  66  55  61  71  62  RBM  68  53  52  80  66  55  63

Frequency band of (8-12) Hz and no EOG filtering:
The results showed that for subject 4, 7 and 9, the SOM mapping has a better separation capacity (see Figure 14 (Subject 4), (Subject 7) and (Subject 9)) than for the rest of the subjects. Likewise, this effect is also represented on Table 2 (using Figure 12), where the winner neurons for the two classes are far apart from each other.
Correspondingly, a similar effect occurs for the classification methods (Table 3), where the same subjects, plus subjects 1, 5 and 8, showed training and testing accuracy higher than chance level ≥60∼80% for LDA, BP and RBM. However, for the other subjects, it is not easy to discriminate between both classes, which is illustrated in Figure 13, where there is no observable difference between the EEG frequency response among these subjects. It is important to notice that the SVM method did not show any discrimination capacity for any subject when applied to the testing dataset. This might be due to the kernel being too general and not working well on high dimension signals.
Frequency bands of (8-12) Hz and (22-30) Hz and EOG reduction: Using the band reduction and the EOG regression method, the results for SOM showed now that the best separation capacity was obtained for subjects 1, 3, 6, 8 and 9.
Although in the case of the classification techniques, there was a slight improvement for subjects 1, 4 and 8 in the accuracy percentage (Table 5), and there was no significant improvement over subjects 7 and 9, which already had good results on the (8-30) Hz band. In this case, it is assumed that there were some hidden attributes for subjects 1, 4 and 8 that were found due to the band reduction or filtering. However, these procedures did not necessarily help the other subjects, where the selected bands may not be optimal or the EOG contamination was not an important factor in their classification. In addition, there was no improvement over the accuracy for the subjects that already performed poorly. In these cases, even the limited band and noise reduction technique could not help to uncover if there was any difference between the classes.
Lastly, for the SVM method, the accuracy was low for every subject except for subjects 4, 5 and 8. In general, the accuracy of SVM highly depends on finding the correct kernel to map the function, meaning that the initial parameters introduced for mapping into a higher dimension were not optimal for this database. Furthermore, SVM usually has problems discriminating when the same parameters are used in every individual, which means that it may require a specific setup for each subject.
In the case of processing times, the training time is reduced using the limited frequency bands and regression method. This is a natural consequence of having a reduced dimensionality of data. Furthermore, while comparing the processing times between the presented methods, it can be observed that SOM has the largest ones, followed by SVM. The former could be due to the process of inserting the data one by one to adjust the map, which could be improved using a batch method. On the other hand, SVM needs to solve a quadratic optimization problem with a large data set and a small box constraint, which limits the algorithm convergence speed. Moreover, it can be observed that the BP method is slower than the RBM method. The reason behind this is that the ANN had many more neurons than the RBM (1000 neurons for BP over 64 neurons for RBM). However, other numbers of neurons for the BP did not give as high accuracy as those obtained.
Finally, although LDA indeed is the fastest of all the presented algorithms by far, which is one of the reasons why its one of the most used methods for BCI, it has the problem of not being easily adaptable to a high number of classes, thus needing different methods for multi-class problems.

Conclusions
With the aim of driving the development of competencies for future engineers and scientists, schools require curricula that is in line with the technological progress and demands of Industry 4.0. Consequently, Education 4.0 is searching for new ways to introduce students to emerging technologies, such as artificial intelligence, and how they are applied on real-life situations.
Accordingly, Education 4.0 is responsible for helping future professionals begin being familiar with the area of artificial intelligence, and, at the same time, it must provide them with the opportunity of testing the acquired knowledge by applying it to real-life scenarios. Therefore, in the scope of the Education 4.0 framework, teachers and students are in need of updated educational material that helps them embrace their path towards teaching and learning more about the technologies that are being used in the incoming industrial revolution. Hence, the main objective of this work was to provide updated teaching/learning material that allows students to have an introduction to a cutting edge technology, such as BCI, which is used for several real-life applications, while providing them with the basic knowledge of five different AI techniques and how they can be applied over the basics of BCI experimentation.
These different AI techniques were presented through a brief review of the methods and their corresponding pseudo-codes while also presenting the results of their implementation on BCI so that students become aware of the problems and possible outcomes of these experiments. This implementation was done over a test bench that consists on EEG and EOG recordings obtained from [37]. From the obtained results, it is important for students to notice that the obtained behavior of each method made sense with the information presented in Table 1; however, the main problem in this work was that none of them was able to always discriminate or even discriminate similarly to all subjects.
Through the description of the AI techniques and the analysis of the results of applying them over the proposed BCI test bench, this work allows students to learn the basic theory of SOM, LDA, ANN-BP, SVM and RBM, as well as giving the guidelines for the application of those techniques over real-life BCI problems. Furthermore, teachers that are beginning to work under the Education 4.0 paradigm can use this work as introductory material to BCI and artificial intelligence, and the proposed test bench can be used by them as a reinforcement exercise or project to test the understanding of students posterior to a BCI or AI lesson.
Notwithstanding of the contribution of this work to the development of updated curricula for the Education 4.0 framework, there is still a lot of work to do. To allow students to achieve a better comprehension of the presented methods, it is important to make an improved test bench with different band sizes as in [37] or use some other type of filter or dimensionality reduction method such as Common Spatial Patterns [69] (this can be seen in Appendix B), which is also part of the current state-of-the-art BCI. Additionally, more advanced artificial intelligence techniques for BCI classification can be explored to provide students with additional information about algorithms that are not only applied on BCI but also on other areas of Industry 4.0. Data Availability Statement: Not Applicable, the study does not report any data.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Reduction of EOG Artifacts over EEG
One of the main sources of noise for EEG is EOG noise. EOG noise comes from retinal dipole and eyelid movements. Both of them create a potential shift in the surface of the scalp [70]. A normal procedure to remove these artifacts is the regression procedure explained in [71], which consists of using three EOG spatial components recordings (horizontal, vertical and radial) multiplied by a specifically weighted coefficient and subtracting them from the noisy signal. For this, it is supposed that the signal without artifacts is described by: with S as the non-contaminated EEG signal, Y is the recorded EEG channel at time t, N are the noise sources (n ver,hor,rad vertical, horizontal and radial recorded EOG channels), and b are the weighted coefficients (b ver,hor,rad ) of the EOG artifacts at the EEG channel. To obtain the real signal S, the noise source has to be recorded, and the weighting coefficients must be obtainable. To calculate b, it is presumed that the noise source N (i.e., EOG) and the signal S (i.e., EEG) are independent, then: with < N T S >= 0 we can calculate b as: with C NN as the auto-covariance matrix of the EOG channels and C NY as the crosscovariance matrix between the EEG and EOG channels. In particular, the three monopolar electrodes for EOG are mounted over the face, as shown in Figure 10, where two bipolar EOG can be derived (i.e., horizontal and vertical EOG activity). To obtain a better approximation of weighted coefficients b, as a normal procedure, at the start of each BCI session, a specific type of EOG recordings are made, where the subjects are asked to perform eye blinks, roll the eyes clockwise and counterclockwise and move the eyes upwards and downwards (Table A1). These movements are done to cover the whole field of view without moving the head. (1) Perform idling eye movements with eyes open and close for a minute each.
(3) Perform eye movements (rolling, left/right and up/down) for over 15 s each. These movements should circumscribe the whole field of view without moving the head.

Appendix B. Common Spatial Patterns
Common Spatial Patterns (CSP) is used to learn spatial filters for brain signal analysis and introduced by Muller [72] for movement-related EEG and later proved to be useful for imaginary hand movement by H.Ramoser [69].
CSP's goal is to design a spatial filter that finds the optimal variance for discrimination, put differently, is to apply a linear transformation that maps the input to have maximum variance between two classes. Although CSP can only be applied to a binary problem, it has already been extended [73,74] by combining various binary spatial filters, which reduce the multi-class problem into various binary decisions. The algorithm of CSP can be seen in Algorithm A1. X a and X b 3: Train network: 4: Calculate the covariance matrix R a = X a X T a trace(X a X T a )

5:
Calculate the covariance matrix R b = Calculate R = R a + R b

9:
Transform the average covariance matrices S a = PR a P T S b = PR b P T

10:
Using S a and S b obtain the generalized eigenvector matrix B 11: Obtain the projection matrix W = B T P 12: Output network: Returns Matrix W 13: Transform the samples using the projection Z a = WX a and Z b = WX b

Appendix C. Database and Code
The database can be found in http://www.bbci.de/competition/iv/#datasets (accessed on 29 July 2021) under the Data set 2b.