Decoding Electroencephalography Signal Response by Stacking Ensemble Learning and Adaptive Differential Evolution

Electroencephalography (EEG) is an exam widely adopted to monitor cerebral activities regarding external stimuli, and its signals compose a nonlinear dynamical system. There are many difficulties associated with EEG analysis. For example, noise can originate from different disorders, such as muscle or physiological activity. There are also artifacts that are related to undesirable signals during EEG recordings, and finally, nonlinearities can occur due to brain activity and its relationship with different brain regions. All these characteristics make data modeling a difficult task. Therefore, using a combined approach can be the best solution to obtain an efficient model for identifying neural data and developing reliable predictions. This paper proposes a new hybrid framework combining stacked generalization (STACK) ensemble learning and a differential-evolution-based algorithm called Adaptive Differential Evolution with an Optional External Archive (JADE) to perform nonlinear system identification. In the proposed framework, five base learners, namely, eXtreme Gradient Boosting, a Gaussian Process, Least Absolute Shrinkage and Selection Operator, a Multilayer Perceptron Neural Network, and Support Vector Regression with a radial basis function kernel, are trained. The predictions from all these base learners compose STACK’s layer-0 and are adopted as inputs of the Cubist model, whose hyperparameters were obtained by JADE. The model was evaluated for decoding the electroencephalography signal response to wrist joint perturbations. The variance accounted for (VAF), root-mean-squared error (RMSE), and Friedman statistical test were used to validate the performance of the proposed model and compare its results with other methods in the literature, including the base learners. The JADE-STACK model outperforms the other models in terms of accuracy, being able to explain around, as an average of all participants, 94.50% and 67.50% (standard deviations of 1.53 and 7.44, respectively) of the data variability for one step ahead and three steps ahead, which makes it a suitable approach to dealing with nonlinear system identification. Also, the improvement over state-of-the-art methods ranges from 0.6% to 161% and 43.34% for one step ahead and three steps ahead, respectively. Therefore, the developed model can be viewed as an alternative and additional approach to well-established techniques for nonlinear system identification once it can achieve satisfactory results regarding the data variability explanation.


Introduction
The monitoring of neural responses when a subject is exposed to a stimulus during medical procedures is commonly conducted through a non-invasive medical technique called electroencephalography (EEG). In this respect, the collected data compose a nonlinear system of observations, whose identification is challenging due to the high degree of nonlinearity. The most common methods for nonlinear system identification are the Nonlinear AutoRegressive Moving Average model with eXogenous inputs (NARMAX) [1,2], Volterra series [3], the Hammerstein model [4], the Wiener model [5], bilinear models [6], and state-space models [7]. Some aspects associated with these models, such as structure selection and determination, parameter estimation, and interpretability, can be characterized as challenges and disadvantages [8]. Alternatively, machine learning is a promising alternative [9].
Several contributions have been made concerning system identification for the humanbrain-signal-movement interface with varying results. Vlaar et al. [10] verified that a linear model explains only about 10% of the EEG data variance in the same dataset as the one used in this study, leaving more than 80% as nonlinearities. A truncated Volterra series was used to model the data in this case. In the work of Tian et al. [11], a similar experiment was conducted on six individuals by recording the EEG response to wrist manipulation. A NARMAX model based on a hierarchical neural network was used, with a pre-processing step using independent component analysis. The proposed model achieved significantly better performance than a polynomial NARMAX model. Zhou et al. [4] studied the response of the human ankle, namely, its angle, to an electrical stimulus. The authors proposed a Hammerstein model based on artificial neural networks to predict the ankle angle. In this model, the parameters are estimated using a genetic algorithm. In Gu et al.'s study [12], a NARMAX model was used to create a dynamic model of cortical responses to mechanical wrist perturbations with one-step-ahead and three-step-ahead predictions.
Aljamaan [5] used a Wiener Box-Jenkins model in conjunction with a prediction error method for system identification in EEG signals as a response to wrist manipulation. The model performed one-step-ahead predictions. Van de Ruit [13] performed experiments to measure time-varying joint impedance on the wrist in response to perturbations in position. The authors used kernel-based regression and skirt-decomposition-based system identification. Westwick et al. [14] used data from the electromyogram (EMG) response to a rotational stimulus in the foot by evaluating the stretch reflex. The authors employed a separable least-squares optimization method to estimate the parameters of a Hammerstein cascade model, usually employed to model biological systems. Also, in relation to decoding motor tasks, Nicolas-Alonso et al. [15] used EEG data to classify the motor task that a test subject performs. The problem is divided into binary and multiclass classifications. Different pre-processing techniques, such as band-pass filtering, spatial filtering, and mutual-information-based best-individual feature selection, are used. The classification uses an ensemble of staked regularized linear discriminant analysis models. In the work of Dalhoumi and Montmain [16], a new method to calibrate models to classify users' motor tasks is proposed, where multiple base learners are trained using data from previous subjects. The base learners are then weighted to form new predictions based on a few data points from the new user, reducing the calibration time.
Li et al. [17] used transfer learning to decode low-sampling-frequency surface EMG in a dataset of 53 different gestures of 10 subjects. Using a low sampling frequency allows for applying low-cost EMG sensors that can be used in robotic prosthesis control. In Lee et al.'s study [18], the weighted ensemble learning of convolutional neural networks (CNN) was used on a dataset consisting of nine types of arm movements from 12 subjects. The base CNNs were trained on data from different sessions but shared a joint weighted loss function in the training process. Therefore, it was possible to generalize the results, leveraging data from all sessions. Fagg et al. [19] suggested decoding arm movements in monkeys with electrode implants in the primary motor cortex area responsible for arm movement. The model consists of multiple signal-filtering methods and a linear filter decoder or a Wiener filter to predict the movement.
Considering the previous related works, there are limited studies regarding the use of artificial intelligence approaches to deal with nonlinear system identification. In this respect, there is an open question: what is the performance of ensemble learning methods compared with the current state of the art, such as NARMAX structures and other machine-learning models? Therefore, in this paper, we seek to provide answers to the previous question. An ensemble learning approach for system identification can be considered within the machine-learning field, such as the approach developed by Li et al. [20]. This methodology increases model accuracy by combining the predictions of several base learners through the weighted average rule in regression problems to solve the same problem [21]. The goal is for each model to learn some data patterns, and when their predictions are aggregated, an effective model is obtained [22][23][24]. A particular approach based on ensemble learning is stacked generalization (STACK). The STACK ensemble learning model executes the training process by layers and has produced accurate results, considerably lowering error values, in regression tasks in different fields of knowledge [25]. In this approach, base learners/models are trained in the first layer (layer-0, set of base learners), and their predictions are used as inputs in the subsequent layer [26]. Then, a meta-learner is trained to obtain the system's final prediction in the next stage. This model will be able to deal with data nonlinearities since layer-0 models add different information from the predicted data for the meta-learner [27].
When aiming to increase the accuracy of some models, the hybridization of two or more methodologies is adequate, such as ensemble learning modeling and optimization (in general evolutionary algorithms or swarm intelligence approaches). Usually, the first is used for data modeling, and the second is used to select the ensemble's hyperparameters. Considering the use of optimization methods in the metaheuristics field to tune hyperparameters, the main advantage is that an objective function will be optimized, allowing us to find suitable hyperparameters and obtain an adequate model, as can be observed in [28]. In this respect, differential evolution (DE) proposed by Storn and Price [29] has already been employed in several fields and achieved success in optimization tasks [30]. In this context, the Adaptive Differential Evolution with Optional External Archive [31,32] (JADE) algorithm should be highlighted. The optimization process becomes even more general when using an adaptive mechanism for the crossover and mutation operations, with no need to infer optimal values for those parameters, incorporating them into the optimization process. Once they are in the stacking ensemble learning model, the final performance depends on the meta-learner, and the suitable choice of hyperparameters can lead to better results, where optimization methods can be adopted.
Metaheuristics have been used with promising results for optimizing machine-learning model hyperparameters [33][34][35]. The gradient-free approach provided by JADE and other evolutionary algorithms makes them particularly well-suited for this type of task. Moreover, in most cases, these methods require defining initial settings to perform the optimization, interfering with the optimization process. To overcome that, we used JADE in this study because it is a self-adaptive version of DE that does not require defining parameters such as crossover and mutation probability. However, we also believe that other self-adaptive methods can be adopted for this purpose once our focus is not on the performance of JADE specifically but on how the accuracy of the stacking ensemble learning method can be improved when an optimizer method is used to tune the meta-learner hyperparameters to the system identification performed in this study. Additionally, evolving ensemble learning methods for free with evolutionary algorithms can be employed to obtain efficient predictive models [36].
Therefore, the objective of this paper lies in the proposition of a new hybrid framework by combining the STACK ensemble and JADE algorithm to realize the identification of a benchmark system related to cortical responses evoked by wrist joint manipulation [10,37] to make one-step-ahead predictions. The adopted benchmark has a single input and single output, and feature engineering was performed to improve the performance of the developed model. Since diversity plays a key role in ensemble modeling, a set of diversified models, varying in their structures, were used to compose STACK's layer-0. The chosen models were eXtreme Gradient Boosting (XGBoost) [38] with a linear booster, a treebased model, a Gaussian Process (GP) [39], a statistical model using a radial basis function kernel, Least Absolute Shrinkage and Selection Operator (LASSO) [40], a linear model with regularization, a Multilayer Perceptron (MLP) Neural Network [41] with one hidden layer, a neural network, and Support Vector Regression [42] with a radial basis function kernel (SVR), a model based on support vectors. All the models come from different families with distinctive characteristics, ensuring the much-needed diversity of the base learners.
In training, 6-fold cross-validation (CV) was used for base learners and the metalearner. To define the layer-0 models' hyperparameters, a grid search was adopted. The Cubist regression-rule-based model [43] was used as a meta-learner, and JADE was employed to obtain its hyperparameters.
This study mainly makes the following contributions: (i) A novel combination named JADE-STACK is proposed and applied to identify EEG signals; (ii) comparisons of the results obtained with the proposed approach and layer-0 models are conducted. At the same time, the performance of the proposed model is compared using variance accounted for (VAF) and root-mean-squared error (RMSE) with the results presented in the literature for the problem studied in one-and three-step-ahead EEG predictions; and (iii) a discussion about the feasibility of the proposed STACK for nonlinear system identification is presented.
The remainder of this paper is structured as follows. Section 2 introduces the benchmark adopted. Section 3 details the methods explored in this paper. Section 4 describes the proposed methodology. Section 5 presents the results and discussion. Finally, Section 6 concludes the paper and suggests future research.

Description of the Benchmark
The dataset used in this study is the first version of the benchmark proposed by [10], available at http://www.nonlinearbenchmark.org/#EEG (accessed on 29 July 2023), and refers to the response in the human cortex to robotic manipulation of the wrist joint. There were ten healthy volunteers (age range 22-25 years old, all right-handed). Participants were seated with their right forearm fixed to the arm support and their hand strapped to the handle of a robotic manipulator, as illustrated in Figure 1a, adapted from Vlaar et al. [10]. Perturbation was applied to the wrist by a robotic manipulator and is considered the input of the dataset.
The cortical activity was recorded using EEG. A headpiece with 126 electrodes arranged according to the 10-5 system [44] was used to measure the scalp potential. All signals were sampled at 2048 Hz with 136 channels. The signal acquisition process is further described in [37]. EEG data were pre-processed with a high-pass filter, independent component analysis was performed, and line noise at 50 Hz was removed in the frequency domain via the discrete Fourier transform and 100 Hz or higher frequencies. The signal was then re-sampled at 256 Hz (N = 256 samples). Further detailed pre-processing is found in [10]. For this system, the independent component of the EEG with the highest signal-to-noise ratio is used as the output at time t (y(t), t = 0, . . . , 1 s), and the handle angle when the wrist is stimulated is used as a system input at time t (u(t), t = 0, . . . , 1 s). Seven multisine perturbation signals are generated, and Figure 1b shows a one-second excerpt of one of the signals. Each participant was subjected to 49 different trials of 36 s each. Each trial utilized three different perturbations of the seven possible, cycling 3-by-3 throughout the 49 trials. Of the seven perturbation signals generated and applied for every participant, the first six were used as the training set, and one was used as a test or validation set. From every 36 s trial, 6 s was removed, as highlighted in Figure 1c, accounting for possible transient effects and leaving 30 s to be utilized.

Methods
This section presents the main aspects of the methods proposed in this paper. First, we present the STACK approach, followed by a description of the layers composing the final ensemble learning model.

Stacked Generalization
Ensemble learning, a methodology utilized in classification, regression, clustering, and nonlinear identification problems, aims to acquire a precise model [45]. This technique follows the divide-and-conquer principle by combining multiple base learners (weak learners) to construct an effective model, typically using the average rule for regression problems [22].
An alternative approach to ensemble learning is STACK-based ensemble learning, which enhances accuracy by integrating diverse base learners and utilizing multiple layers. Initially, base learners are trained in the first layer (layer-0), and their predictions are then utilized in subsequent layers. Subsequently, a meta-learner (layer-1) is trained using the predictions from the previous layer as inputs to generate final predictions. The effectiveness of this approach lies in its ability to capitalize on different models producing distinct predictions, from which the meta-learner learns.
The STACK ensemble learning method improves the forecasting output by reducing the variance in forecast errors or rectifying biases. Through this learning process, the results tend to converge toward an improved solution compared to using only the base learners [27]. However, one drawback of this approach is the challenge of selecting the appropriate number of base learners. Nevertheless, this issue can be addressed by employing readily available general-purpose machine-learning wrappers, such as caret for R [46], which facilitate the testing of numerous different base learners, as they offer a unified syntax. Figure 2 illustrates the STACK ensemble learning methods for regression. It is worth observing that STACK ensemble learning is an efficient tool to deal with different kinds of data features' predictive contexts in some fields, such as financial [47], epidemiological [48], agribusiness [25], seismic [49], electrical power system [50], wind power [51], and photovoltaic solar energy [52] forecasting.

Models Used in STACK Methodology
This subsection introduces the base learners GP, SVR, LASSO, MLP, and XGBoost that compose STACK's layer-0 and Cubist as the meta-learner.

•
A GP consists of a collection of random variables that follow a Gaussian distribution and are completely characterized by their mean and covariance (kernel) function [39].
The particular method in this paper adopts a GP with a radial basis function kernel. In this model, the sigma value serves as the sole hyperparameter. • SVR involves identifying support vectors (points) near a hyperplane that maximizes the margin between two-point classes, which is determined by the difference between the target value and a threshold [42]. SVR incorporates a kernel, a function used to assess the similarity between two observations, enabling it to handle nonlinear problems. The radial basis function kernel is specifically employed in the model described in this paper. In this model, the sigma value serves as the sole hyperparameter. • LASSO uses regularization terms to improve the model's accuracy by sacrificing some bias to reduce the variance. Additionally, if there are some correlated predictors, LASSO selects the best set of them [40]. The hyperparameter of this model is the regularization value. • The MLP is a feedforward neural network composed of one input layer (system's input), one or more hidden layers, and one output layer (system's outputs). In contrast to artificial neural networks without hidden layers, the MLP can solve nonlinearly separable problems [41]. In this paper, the MLP has a single hidden layer. The hidden layer activation function is the sigmoid function, while the output activation function is the identity. Using this configuration, the only hyperparameter for this model is the number of Hidden Units. • XGBoost is based on a gradient boosting ensemble learning model and uses an additive learning strategy, which adds the best individual model to the current forecast model in the i-th prediction. A complexity term in the objective function is added to control the models' complexity and helps smooth the final learned weights to prevent overfitting. Regularization parameters are used to avoid the estimation variance related to the base learner and to shrink them to zero [38]. The hyperparameters of this model are the number of boosting iterations, L 1 regularization, L 2 regularization, and the learning rate. • Cubist is a rule-based model and operates using the regression tree principle [43]. Each regression tree and leaf is built with a rule associated with the information contained in each leaf. Given all rules, the final prediction is obtained as a rule's linear combination. The concepts of committees (set of several Cubist results aggregated using the average) and the neighborhood are considered to build an accurate model. The hyperparameters of this model are the number of committees and neighbors.
A 6-fold cross-validation (CV) grid search was adopted to find base learners' hyperparameters. This approach consists of evaluating all possible combinations of hyperparameters through an exhaustive search and determining the most suitable set of values. A grid search is a suitable approach for the stage in this paper since there are several base learners, and using optimization techniques would make the process computationally expensive. Once one meta-learner exists, the JADE approach finds the Cubist hyperparameters.

Adaptive Differential Evolution with Optional External Archive
Different DE-based approaches have shown themselves to be capable of achieving good results in various optimization tasks, as presented in [53], incorporating a differentialevolution-based search into the Downhill Simplex method. The JADE algorithm improves DE optimization, which implements a new mutation strategy named "DE/current-topbest/1" with the optional external archive and updates control parameters in an adaptive manner [31,32].
According to [31], the step-by-step process employed in the JADE approach is described as follows: • Initialization: The first population is randomly initialized according to where i = 1,. . . , NP, j = 1, . . . , d, LB and UB are the lower and upper boundaries of the search space for the parameter j, rand j denotes a uniform distribution within [0,1], x denotes a candidate solution (or agent) in the population, and NP is the size of the population. After that, the algorithm enters a loop of evolutionary operations: mutation, crossover, and selection. • Mutation: Considering that the mutation strategies of classical DE are fast, but the convergence can be less reliable, the JADE approach adopts the mutation vector as follows. With P as the set of current solutions and A as the poor solutions archived, the mutation vector v i,g can be written as where x i,g , x r,1,g , and x p best,g are selected from P, andx r2,g is randomly selected from P ∪ A. Moreover, the mutation factor F i associated with x i is restarted adaptively in each generation (iteration). In this context, the generation of the F i parameter can be described as where randc i (µ F , 0.1) refers to a Cauchy Distribution with a location parameter of µ F and a scale parameter of 0.1. However, if the value of F i is less than or equal to 1, the distribution is truncated to 1. Conversely, the distribution is regenerated if F i is greater than or equal to 0. The set S F represents all the successful mutation factors in generation g. The initial value of the location parameter is set to 0.5. At the end of each generation, this parameter is updated as where mean L is the Lehmer mean, and c is a constant value between 0 and 1. Second, according to [54], the purpose of this archive is to counteract the greedy nature of the current best solution within the mutation phase, preventing premature convergence. • Crossover: In the crossover step, a binomial operation forms the trial/offspring vector or individual i in generation g and dimension d, u i,g = {u 1,i,g , u 2,i,g , . . . , u d,i,g }, where where rand j (0, 1) is a uniform number chosen randomly between these boundaries, and j rand is an integer randomly chosen from 1 to d and is newly generated for each i. In this respect, CR i is the crossover probability of each individual x i and is independently generated according to where randn represents a normal distribution with a mean of µ CR and a standard deviation of 0.1. The values generated from this distribution are then truncated to the range of [0, 1]. To adaptively update the crossover probability, the set S CR is introduced, which consists of successful crossover probabilities CR i in generation g. In the first generation, the mean is initialized to 0.5. Then, in each subsequent generation, the mean is calculated using where mean A is the usual arithmetic mean. • Selection: Finally, it is necessary to find the new population by evaluating the elements (u i,g ) with the cost function. In this respect, if the value of the evaluated cost function for the offspring is better than the value for the parent (x i,g ), the new element (x i,g+1 ) will be the mutated vector. Otherwise, it will be equal to the early parent.

JADE-STACK Algorithmic Description
The main steps adopted for the identification are described as follows: 1.
Initially, six perturbation signals of the seven available for each participant are used as the training set. Every participant's seventh and last signals are used as a test set. This split is applied based on the initial evaluation of the dataset reported in [10,11] so it will be possible to compare the results. Moreover, considering the first six perturbation signals, a six-fold CV is used to train and validate the proposed model. Consequently, according to the proposal by [10], system identification is made for each participant individually one step ahead and three steps ahead, as suggested in [11]. Some features are obtained based on lagged input in one instant [u(t − 1)]. The mean, standard deviation, and skewness in a window of three observations are defined. The difference between the lagged inputs in one and two instants is obtained. Finally, [u(t − 1)] 3 and the logarithm of u(t − 1) are also calculated to be used as input features.

2.
In sequence, the models described in Section 3.2 are trained according to the structure where f is a function used to map data, y(t) and u(t) are the output and input of the system at time t (t = 1, . . . , 256), n y = 5 is the maximum lag for outputs, n u = 3 is the maximum lag for inputs, u F (t − 1), (F = 1, . . . , 6) are the features based on u(t − 1) described in step 1, and (t) is the residual. A six-fold CV is adopted in this process, and the set of hyperparameters is defined for each model using a grid search. The obtained predictions compose STACK's layer-0.

3.
In the meta-learner training, the JADE approach is used to find the Cubist hyperparameters based on the maximization of variance accounted for (VAF): where var is the data variance,ŷ(t) is the predicted output, and y(t) is the observed output. The Cubist hyperparameters are integer variables named the number of committees and neighbors. During the optimization process, real values are rounded to the next integer value. Indeed, the search space is defined between 0 and 60 for committees and 0 and 20 for neighbors. Additionally, 10 iterations (generations) and a population equal to 10 are used in JADE optimization. Once the hyperparameters are obtained, the meta-learner is trained again using a six-fold CV, and final predictions are obtained. Moreover, the RMSE is also computed as described in (10): where y i andŷ i are the i-th observed and forecast values, respectively, and n is the size of the time series. 4.
Finally, the results obtained in steps 2 and 3 are utilized to assess the statistical differences between the compared models using the variance-accounted-for (VAF) criterion. The Friedman test can be employed [55,56] to determine whether there is a significant difference among more than two models. This test helps determine whether a set of k models (greater than two) exhibit statistically significant differences in their results. The test statistic, denoted as FD, is calculated as follows: Here, FD follows a chi-squared distribution with k − 1 degrees of freedom, based on n observations and k groups. The null hypothesis assumes that no difference exists between the results of the different groups. Subsequently, if the null hypothesis is rejected, a post hoc test is necessary to identify which groups have distinct results. In this case, the Nemenyi multiple-comparison test can be employed. This test determines the critical difference (CD), defined as where CD represents the critical difference required to determine whether there is a significant difference in the measures of the groups. q ∞,k,α corresponds to the studentized range statistic, k represents the number of algorithms, α signifies the significance level, and n denotes the number of samples. If the absolute difference in rank sums |R i − R j | is greater than CD, it indicates a significant difference between the results of algorithms i and j [57]. Thus, the Friedman and the Nemenyi multiple-comparison tests are applied to compare the errors obtained from all proposed algorithms. Figure 4 presents the framework of the proposed optimized STACK. The results presented in Section 5 were generated using the caret package [46] of R software version 4.3.1 [58]. Table 1 shows the hyperparameters selected for each adopted model. Additionally, when the DE method was used to tune the meta-learner hyperparameters, the population size, crossover ratio, and mutation factor were equal to 100, 0.5, and 0.2, respectively. For the genetic algorithm (GA) method, the population size, mutation probability, and crossover probability were equal to 100, 0.3, and 0.8, respectively. In the same vein, for Particle Swarm Optimization (PSO), the swarm size, particle velocity, cognitive coefficient, social coefficient, and inertia weight were defined as equal to 100, 2, 2, 2, and 0.729, respectively.   Furthermore, Table 2 presents the calculation times of the training process of the proposed approach and the base models. It is important to note that although the average processing time of the JADE-STACK model appears to be longer, it is not designed for real-time learning. This means the model requires approximately 20 min to train, but once trained, it can be used for as long as it maintains accuracy. Additionally, the prediction step for the target dataset is almost instantaneous. Therefore, the training process should not pose any issues or limitations for its implementation in a production system.

Results and Discussion
Tables 3 and 4 present the VAF and RMSE for ten participants in the forecasting horizon of one and three steps ahead, respectively, as proposed in [11]. Additionally, the average and standard deviation (Std) of the proposed model are compared with the results from the literature for the same dataset.
For the VAF comparison in one-step-ahead predictions, the optimized STACK improves the VAF by 1.51%, 8.48%, 2.85%, 5.50%, and 7.22% compared to LASSO, GP, MLP, SVR, and XGBoost, respectively. For the VAF in terms of three-step-ahead predictions, the proposed model can improve the explained variability by 1.22%, 13.66%, 11.64%, 9.43%, and 15.04% compared to LASSO, GP, MLP, SVR, and XGBoost, respectively. When we are dealing with the RMSE criterion, similar results are achieved. These results indicated that the proposed approach could learn the data behavior better than the other models. Additionally, for short-term forecasting, it can be stated that the proposed model can outperform the other compared methods regarding accuracy once it better explains the data variance.  As pointed out by [12], multi-step-ahead predictions are still challenging, especially for cortical activity. The recursive strategy used to perform multi-step-ahead predictions can be well suited. However, its use implies carrying the errors from the previous steps into the following predictions, making every step more susceptible to errors. This explains why the three-step-ahead predictions had a smaller VAF and a higher RMSE and shows how these predictions can be challenging for EEG signals.
The proposed framework achieves better results regarding accuracy than those obtained by [10,11] in the one-step-ahead prediction when the NARMAX-Hierarchical Neural Network (NARMAX-HNN) and Volterra series were used. JADE-STACK was able to identify the data with greater accuracy and fewer parameters. In [11], NARMAX-HNN has 19 parameters, the Volterra series has 46 parameters, while the model proposed in this paper has 11 hyperparameters for one step ahead and three steps ahead. Furthermore, considering the NARMAX-Polynomial (NARMAX-P), the proposed JADE-STACK is competitive and achieves better accuracy regarding VAF since a similar proportion of variability is explained with fewer parameters than the 24 needed for NARMAX-P. Also, the improvement over state-of-the-art methods ranges from 0.6% to 161% and 43.34% for one step ahead and three steps ahead, respectively. The improvement in VAF achieved by the proposed JADE-STACK is 2.4%, 0.6%, 120.6%, and 161% compared with the VAF of NARMAX-HNN, NARMAX-P, Volterra 1 , and Volterra 2 for one step ahead. Alongside that, for three steps ahead, the improvement in VAF obtained with JADE-STACK compared with NARMAX-P is 43.34%. However, the VAF of NARMAX-HNN is 2.67% better than that of the proposed method. In terms of three steps ahead, considering the VAF criterion, the proposed model can achieve better results than NARMAX-P and is similar to NARMAX-HNN. Compared with these approaches, the proposed model seems more suitable for identifying a system composed of cortical activities since only 11 parameters are used (9 for base learners and 2 for the meta-learner). An important challenge with these data is the nonlinearities therein. As stated by [10], a simple linear model could only explain about 10% of the data variance, leaving the remaining data as nonlinearities. The machine-learning models, combined with the chosen features, show values of VAF that are significantly higher than 10%, as shown in Tables 3 and 4. When the proposed model results are compared with the DE-STACK model, it is possible to observe that using the JADE optimizer leads to better results by finding suitable hyperparameters for the meta-learner. Indeed, the improvement achieved by the proposed model is 1.70% and 2.86% in terms of one-and three-step-ahead predictions, respectively. Similar results are obtained by comparing JADE-STACK with GA-STACK and PSO-STACK. These results can be attributed to the fact that the JADE approach tunes its hyperparameters during the evolutionary process. In contrast, the DE, GA, and PSO parameters (crossover and mutation probability) are tuned by trial and error. Figures 5-14 show the comparison between observed and predicted values for all ten participants, as well as the residuals of predictions (difference between the observed and predicted values), in the test set (trial seven), according to the proposed JADE-STACK model in the task of predicting values one step ahead and three steps ahead. textcolorredThe horizons were based on the benchmark problem, so it would be possible to compare the results with the models proposed previously using the benchmark detailed in this paper. As pointed out by Tian et al. [11], dynamic brain activity is on the order of milliseconds, and the three steps ahead represent 12 ms ahead for EEG oscillations. Therefore, it is suitable for testing the proposed and compared frameworks. Moreover, further studies can be considered to introduce the concept of free-run for EEG prediction, as pointed out by Ayala and Coelho [59]. While the proposed model does not offer significant improvements for three-step-ahead predictions compared to the existing literature, the results are at least similar in terms of accuracy and consistency. However, most machine-learning problems do not have a clear all-purpose model to solve them. As stated by the No Free Lunch theorem [60,61], the engineer has the task of testing, identifying, and selecting the best tool to solve the problem at hand. Therefore, we can consider the evaluation of the proposed model valid. Indeed, the presented results are related to the performance of JADE-STACK for trial seven after performing a six-fold CV for each participant over the first six trials. In most cases, the proposed framework could capture the data variability and learn the data behavior by showing a similar pattern between observed and predicted values. In the Kolmogorov-Smirnov test (K = 0.0322, p-value = 0.7558), the residuals are white noise.

One-step-ahead
Three-steps-ahead    According to the Friedman test, there is a significant difference, in a statistical sense, between the VAF for all participants between the proposed and compared models for predicting one step (χ 2 10 = 93.29, p-value = 1.22 × 10 −15 ) and three steps ahead (χ 2 8 = 42.98, p-value = 8.83 × 10 −7 ). Figures 15 and 16 depict the CD plot based on the Nemenyi test. When a line does not join two algorithms, there is a difference between their evaluated measures. When comparing the proposed model with others in the literature, it achieves the best results for one-step-ahead predictions and is the second-best for three-step-ahead predictions. Even though the errors of JADE-STACK and some models are statistically the same, the results obtained using the two models are different. Minor differences exist and should be considered [62].

Conclusions and Future Research
This paper proposes a novel combination of machine-learning and non-parametric regression models to build a STACK model to identify a system related to cortical responses extracted through EEG. JADE optimization was employed to tune the Cubist hyperparameters by maximizing the VAF. The proposed ensemble uses LASSO, GP, MLP, SVR, and XGBoost as STACK's base learners and Cubist as the meta-learner. It also compared the performance of the proposed model with the base-learner performance and results reported in the literature from [10,11] for the analyzed benchmark in the context of one-step-ahead and three-step-ahead predictions of EEG signals.
The proposed optimized STACK outperforms the base learners and most of the reported methods from the literature proposed by [10,11], as evaluated VAF. These results show that hybridizing the STACK ensemble methodology with metaheuristics for system identification is a suitable tool since the obtained model can adequately capture data variability. We showed that our proposed STACK model is appropriate for this task. By using predictions of layer-0 models, the meta-learner learns the data behavior, accommodates data nonlinearities, and can obtain predictions similar to observed values. Also, it is crucial to notice that even single models, such as LASSO, had good performance considering VAF when compared with other models in the literature, which shows that even simpler machine-learning regression techniques can be appropriate approaches for system identification.
Despite the proposed framework achieving satisfactory results in terms of VAF, some limitations of the developed model can be identified. First, when we deal with the stacking ensemble approach, the selection of base and meta-models is challenging since there are no guidelines regarding the number of base models to be used and how many layers should be applied. Also, whether there is a better choice for a meta-model remains an open question. Third, feature selection is an essential step of system identification. However, this was not the focus of this study, despite being an essential step of data modeling.
The following future research is intended: (i) evaluate the proposed model using a free-run simulation [59]; (ii) adopt multi-objective optimization for hyperparameter tuning and feature selection using principal component analysis and independent component analysis; (iii) compare the proposed machine-learning model with variants of the highorder autoregressive models; (iv) to evaluate the use of L-SHADE [63] (Success-Historybased Adaptive DE algorithm with Linear Population Size Reduction) and EPSDE [64] (DE Algorithm with Ensemble of Parameters and Mutation and Crossover Strategies), as well as swarm-based methods, such as the chaotic particle swarm optimizer in [65], the particle swarm used in [66] with a quasi-Newton local search, owl optimization [67], falcon optimization [68], the Jaya optimization algorithm [69], and cheetah optimization [70]. These variations have shown promising results in different areas and could also be used for system identification and hyperparameter tuning. Indeed, this proposed future research will help develop a more complete and effective model for the adopted dataset. Also, the intention of employing these suggestions is to develop an enhanced ensemble learning model to be applied in system identification.
Future research will approach high-order autoregressive (AR) models in comparison with machine-learning models for different step-ahead prediction horizons.