Bayesian Optimized Echo State Network Applied to Short-Term Load Forecasting

: Load forecasting impacts directly financial returns and information in electrical systems planning. A promising approach to load forecasting is the Echo State Network (ESN), a recurrent neural network for the processing of temporal dependencies. The low computational cost and powerful performance of ESN make it widely used in a range of applications including forecasting tasks and nonlinear modeling. This paper presents a Bayesian optimization algorithm (BOA) of ESN hyperparameters in load forecasting with its main contributions including helping the selection of optimization algorithms for tuning ESN to solve real-world forecasting problems, as well as the evaluation of the performance of Bayesian optimization with different acquisition function settings. For this purpose, the ESN hyperparameters were set as variables to be optimized. Then, the adopted BOA employs a probabilist model using Gaussian process to find the best set of ESN hyperparameters using three different options of acquisition function and a surrogate utility function. Finally, the optimized hyperparameters are used by the ESN for predictions. Two datasets have been used to test the effectiveness of the proposed forecasting ESN model using BOA approaches, one from Poland and another from Brazil. The results of optimization statistics, convergence curves, execution time profile, and the hyperparameters’ best solution frequencies indicate that each problem requires a different setting for the BOA. Simulation results are promising in terms of short-term load forecasting quality and low error predictions may be achieved, given the correct options settings are used. Furthermore, since there is not an optimal global optimization solution known for real-world problems, correlations among certain values of hyperparameters are useful to guide the selection of such a solution.


Introduction
To supply enough energy, operate, and maintain a power system efficiency as well as to trade energy profitably, it is necessary to know how much power will be demanded-that is, it is important to forecast the power system load, a task that is traditionally trusted to the experience of statisticians, engineers and experts from electricity industries. However, the growing introduction of renewable distributed energy, for both sources and consumers, has significantly changed the load profiles, so that traditional approaches are not effective anymore.
With the development of artificial neural networks, several forecasting models based on ANNs are proposed to enhance the forecasting accuracy of nonlinear time series-for example, the Echo State Network (ESN), which is a type of reservoir for recurrent neural network trained through the learning approach of reservoir computing. The original ESN concept used randomly fixed created reservoirs, and this concept is considered to be one of the main advantages of this effective dynamic neural network model. One advantage of ESN compared to other feed-forward and recurrent ANNs are that only the connections between the reservoir and the output layer need to be trained, thereby reducing the computational complexity for the training phase and making it less likely to get trapped in local optima. Also, ESNs are one of the most well-known types of recurrent neural networks because of their excellent performance when nonlinear dynamic system modeling. It is capable to learn the complex nonlinear behaviors of dynamical systems and these have been successfully applied to a wide range of engineering problems, including time series forecasting [49][50][51] and load forecasting [30,[52][53][54][55][56] problems.
ESN uses an interconnected recurrent grid of processing neurons called a dynamical reservoir to replace the hidden layer of classical recurrent neural networks. The basic idea of the ESN is a supervised learning scheme to transform the low dimensional temporal input into a higher dimensional echo state, and then train the output connection weights to make the system output the desired information. However, the use of ESN requires initially the definition of a set of parameters, and this may be done through a trial and error approach or optimization of hyperparameters. In the case of hyperparameter optimization, it is carried out chiefly through metaheuristics, including evolutionary computation and swarm intelligence algorithms. In [52], the Bayesian optimization of Gaussian process hyperparameters in reservoir computing is proposed and the results over two benchmarks, the prediction of a chaotic laser system and a Nonlinear Auto Regressive Moving Average (NARMA) model show that the method is robust. The types of the reservoir considered in their experiments were the dynamic nonlinear delay nodes and the ESN. A variation in the Bayesian optimization of ESNs is proposed in [53].
Like most machine learning algorithms, ESNs possess several hyperparameters of the dynamic reservoir that have to be carefully tuned to achieve a better performance. The ESN approach adopted in this paper has seven hyperparameters: spectral radius, internal units, input scaling, input shift, teacher scaling, teacher shift, and feedback scaling.
Bayesian optimization algorithms (BOAs) applied to echo state networks are found in the literature with two different goals: parameter optimization and hyperparameter optimization. The works in [57][58][59][60] employ the BOA for training the ESN parameters (read-out weights), while [61] employs the BOA for tuning ESN hyperparameters. However, in [61], there is no concern for or investigations about the selection of the acquisition function set in the BOA and experiments are based on theoretical benchmarks. Like [61], the main contribution of this paper is the investigation of BOA based on the Gaussian process to find suitable hyperparameters of ESN, but applied to short-term load forecasting problems and with additional contributions as follows: (a) the validation of the surrogate utility function known as acquisition function using three approaches, which are expected improvement, lower confidence bound and the probability of improvement. The mentioned approaches were validated in BOA design to reduce forecasting errors and improve the generalization of ESN. (b) A comparison of the proposed ESN model based on BOA with three support vector regression approaches; ESN models achieve better forecasting performance in terms of mean square error minimization on two real-world case studies of short-term load forecasting, one from Poland and another from Brazil. Short-term load forecasting, the forecasting of load ranging from one hour to one week ahead, plays an essential role in the safe and stable operation of the electric grid and has always been a vital research issue for energy management. In this perspective, it could be useful to improve management efficiency and reduce the grid operating cost in power systems.
The remaining parts of this study are organized as follows. Section 2 gives a theoretical background for ESN and the BOA. Followed by an explanation of the proposed ESN approach, the findings are presented in Section 3 and discussed in Section 4. Finally, Section 5 presents the paper's conclusions and future research directions.

Background
This section gives an overview of ESNs, including their basic theory and hyperparameters. Moreover, the fundamentals of BOA are presented.

Echo State Networks
Recurrent Neural Networks (RNNs) are ANNs with recurrent weights-that is, neuron outputs from forward layers may be weighed as inputs to neurons from backward layers, creating a sort of internal memory in the network, making this a non-linear dynamic system (which is an advantage), subject to difficulties in the learning process, likely leading to problems than may be without or slower convergence [54] (which are disadvantages).
Reservoir computing has emerged in recent years as an alternative to gradient descent methods for training RNNs. They have a "reservoir" of dynamics that can be easily tapped to process complex temporal data. ESN is a type of RNN, part of the reservoir computing framework that has a sparse reservoir and a simple linear output.
An ESN, as illustrated in Figure 1, uses a least-squares based learning algorithm in a supervised training form to modify only the output weights based on the inputs and reservoir weights previously randomly assigned, first proposed by Herbert Jaeger in 2001 and later corrected in [62]. The ESN is composed of input units u, internal units f , a recurrent layer called reservoir with fixed sparse hidden-to-hidden connections, and output units f out . The output of internal units is called the state x ( Figure 1a).
Network weights are divided into four categories, they are input weights W in (Figure 1b), reservoir weights W (Figure 1c,d), output weights W out (Figure 1e), and feedback weights from the output W back (Figure 1f).
The input weights are randomly initialized from a uniform distribution within a specified range between −1 and 1. The reservoir weights are fixed during the training stage, and only the output weights need to be adapted. The hidden layer state vector named "reservoir" is composed of fully connected nonlinear neurons. The reservoir outputs are named echo states when the ESN satisfies the echo state property (ESP).
The states of the reservoir are updated according to x(n + 1) = f W in ·u(n + 1) + W·x(n) + W back ·y(n) where n is a time series sample, x, u and y are vectors of reservoir states, inputs and outputs respectively, f are the activation functions of the reservoir neurons. The output signal y is described by The training process is realized in the sense of least squares minimization as in where T is the desired output vector, also known as the target. The matrix inversion is usually made through the Moore-Penrose operator or the pseudo-inverse operator.
The main features that distinguish ESN from other RNNs are the fact that (a) the hidden layer of the ESN is a sparsely connected reservoir and (b) only the connection weights between the reservoir and the system output need to be trained, where often only a linear regression problem needs to be solved to finish the training process.
The training process is realized in the sense of least squares minimization as in where is the desired output vector, also known as the target. The matrix inversion is usually made through the Moore-Penrose operator or the pseudo-inverse operator.
The main features that distinguish ESN from other RNNs are the fact that (a) the hidden layer of the ESN is a sparsely connected reservoir and (b) only the connection weights between the reservoir and the system output need to be trained, where often only a linear regression problem needs to be solved to finish the training process. the input unit number one, the internal unit number three, the state of internal unit number two, and the output unit number one.
The echo state property investigated by [62] states that, given two different initial states and a time sequence of inputs, the resulting states converge to similar values independent of the initial states of the reservoir, implying that, after a transient initial time, the states are an echo of the history of the inputs. A sufficient condition to the echo state property existence is that the highest absolute eigenvalue of the matrix-spectral radius-must be lower than unity.

Bayesian Optimization Algorithm
The BOA is an efficient framework for the global optimization of black-box functions without using gradient information, formalizing the optimization problem by learning a distribution of The echo state property investigated by [62] states that, given two different initial states and a time sequence of inputs, the resulting states converge to similar values independent of the initial states of the reservoir, implying that, after a transient initial time, the states are an echo of the history of the inputs. A sufficient condition to the echo state property existence is that the highest absolute eigenvalue of the W matrix-spectral radius-must be lower than unity.

Bayesian Optimization Algorithm
The BOA is an efficient framework for the global optimization of black-box functions without using gradient information, formalizing the optimization problem by learning a distribution of functions consistent with our data. It employs a probabilistic model to capture the unknown function and a Gaussian process is often a popular choice as a probabilistic model.
The principle of the BOA is to iteratively estimate a Gaussian process model, specified by its mean and covariance functions of the cost function f (X) and the corresponding value of X that minimizes it. Assume initially that the vector of decision variables to be optimized is X ∈ R D from which some sample points of decision variables are randomly selected (e.g., uniform, log-scaled, among others) and named X i , then their corresponding costs Y i are calculated to compose the set of input-output pairs {X i , Y i }. This set is then used to estimate a Gaussian process model, which is a probabilistic model given by where P is the model output represented as a probability of Y given X i and Y i -more precisely, the posterior probability Q of Y given the Gaussian process composed of base functions h, its coefficients vector β, the prior probability model given by its mean µ and the covariance kernel function k with parameters θ, and finally by the noise standard deviation, σ 2 .
A Gaussian process defines a prior distribution over the universe of possible functions which can be transformed into a posterior distribution over functions, given only a finite set of points. The fitting of the Gaussian process model estimates β, θ and σ 2 , with the prior mean µ usually set to zero. After estimating the Gaussian process model, the BOA uses a surrogate utility function known as acquisition function to update X i . A common choice to the acquisition function is the expected improvement (EI) maximization of evaluating the objective at each potential new point, which is given by where EI(X, Q) is the expected improvement as a function of decision variables X and posterior probabilities of the estimated Gaussian process Q, x best is the location of the lowest posterior mean, µ Q (x best ) is the mean of Q at x best and f is the cost function.
The key attribute responsible for the triumph of BOA approaches is that evaluating the acquisition function depending on the predictive distribution of the Gaussian process is inexpensive compared to evaluating the black box objective.
The maximum EI, (i.e., x best ) is found by sampling thousands of x from X and evaluating µ Q at those points, the best candidates are selected and then x best is found through local search. The stopping criterion is given by the number of iterations through which the aforementioned steps are executed.
Alternatively, the acquisition functions may be of the "lower confidence bound" and "probability of improvement" types. The probability of improvement (PI) is the probability that a new x lead to a lower objective function value f (x) modified by a margin m and is given by where PI is the probability of improvement, P Q is the posterior probability, X is the set of decision variables, Q is the modelled Gaussian process, µ Q is the mean of Q and x best is the value in X related to the lower µ Q and m is the noise standard deviation. The "lower confidence bound" type of acquisition function is the maximization of a curve two standard deviations below the posterior mean at each point, as given in Energies 2020, 13, 2390 where LCB is the lower confidence bound, X is the set of decision variables, µ Q is the posterior mean and σ Q is the posterior standard deviation.

Proposed BOA Optimized ESN for Load Forecasting
This section presents the problem formulation, the datasets, the performance metric used, and the proposed method for load forecasting.

ESN Hyperparameters
Hyperparameters are crucial to many machine learning algorithms. Conventional methods like grid search and random search can be expensive when tuning the hyperparameters of an ESN model. Recently, BOA approaches have become popular in the machine learning community as efficient tools for tuning hyperparameters. In other words, BOA can be useful to model-based approaches for automatically configuring hyperparameters to generate a surrogate model of some unknown function that would otherwise be too expensive to query in full.
Our proposal consists in evaluate Bayesian optimized ESNs for load forecasting tasks. However, hyperparameters tuning of the reservoir for ESN have a great influence on the performance of the network, so suitable values must be set for them. The training of an ESN requires the specification of basically seven hyperparameters, named: spectral radius, size of the reservoir (number of internal units), input scaling, input shift, teacher scaling, teacher shift and also feedback scaling. In our proposal, these hyperparameters are approached as decision variables to be optimized and the activation function adopted is the sigmoid function.
The hyperparameters of the ESN are optimized with BOA, each with a different kind of acquisition function for sake of comparison, which is (i) the expected improvement (EI), (ii) the probability of improvement (PI) and (iii) the lower confidence bound (LCB).
The ESNs are set with one input and one output, where the input is a sequence of load history (time series) and the output is a sequence of load predictions one-step ahead. Hence, the problem is stated as one-step ahead forecasting-that is, the forecasting horizon, also known as the prediction interval, is set to one. Data pre-processing consists of dividing by the maximum value for normalization and division into training, validation and test sets. The single feature considered is the load history one hour behind, since we expect that the echo state property reminds us of the recent history of loads that entered the reservoir.
The search space of hyperparameters are given in Table 1. The spectral radius is a positive value and must be lower than one to guarantee the echo state property, the number of internal units was empirically limited to 100 due to computational burden (the continuous value is rounded before being used), the input/teacher scales and shifts give freedom to optimize the respective normalization and operation points, and finally, the feedback scaling determines the weights of the output to the reservoir.
The cost function receives, as arguments, a set of hyperparameters, the training input, the output, the validation input, and the validation output, generates the ESN architecture, trains it with the control hyperparameter settings and returns the validation error. This process is repeated for 50 iterations for each run, with a total of 50 independent runs.

Datasets
Fortunately, electricity load data are abundant and, most of the time, public. However, to compare the performance of a proposed method it is necessary to know, beyond the data, the reported results from other publications, the approach adopted for data division in training and test sets, the forecasting horizon and the metric used for results evaluation.
The load forecasting problem may be formulated as a univariate or multivariate time series forecasting problem. Univariate problems use previous values of the time series as predictors, while multivariate problems use additional features such as weather and calendar data. Additional features may improve the performance of the prediction; however, care must be taken with collinearity and the extra effort necessary to predict weather features. For example, if a model uses temperature as an input, the evolution of predictions over time will need to feed back the predicted values as well as the unknown future temperature values that will be forecasted. In this study, the univariate load forecasting approach is adopted.
We adopt two datasets, the first is a load time series in GW (gigawatts) from Poland, also used by [14] and can be downloaded from [63], which is composed of 1601 samples, 1400 for training and 201 for testing. Since ESN explores the sequential feature of time series, cross-validation is not possible, so the validation set to be used as the cost during the optimization process is composed of the last 200 samples from the training set in a hold-out scheme.
The second dataset is from Brazil and there are no results to be used as a benchmark, so we also intend to contribute by generating a benchmark for these data, and to achieve that we followed the same strategy to deal with data as in [2]. The data selected are the hourly loads in MW from the first four weeks from the south region in Brazil; the first three weeks are for training and the test is performed over the fourth week, the validation set is drawn from the training and corresponds to the third week, also in a hold-out scheme. The Brazilian dataset can be downloaded from [64].
The dataset descriptions are exhibited in Table 2 and Figure 2. The size of the Polish time series is higher than the Brazilian dataset, but both dataset measures are regularly sampled at hourly intervals. The mean values of both datasets are quite close to the median, showing that the values are not skewed. The autocorrelation function bar charts indicate that there is a significative correlation among many past values and the current value of the time series. Considering each significative lag as an input would lead to a lot of inputs in the model and hence it would possibly cause overfitting due to the curse of dimensionality. ESNs, as any RNN, are appropriate for predicting time series because they can deal with those lags implicitly in the hidden layer, called a reservoir, through an intrinsic memory and as so they do not require each lag as a different input. including the mean squared error (MSE), the mean absolute error (MAE), the mean absolute percentage error (MAPE). In this paper, test results are evaluated in terms of MSE given by where N is the number of evaluation points,ŷ is the forecasted value and y is the actual value. If the model structure and parameter estimates obtained by the ESN are statistically valid, the system model residual should be uncorrelated with all linear and nonlinear combinations of past inputs and outputs. For nonlinear systems, under some mild assumptions, this can be tested by computing the following correlation tests proposed by [65] are calculated as where δ(·) is the Kronecker delta function u 2 (t) = (u(t)) 2 − u 2 , (ξu) = ξ(t + 1)u(t + 1), a is the mean of a and ϕ ab is the normalized cross-correlation given by In practice, if these correlation functions in Equation (9) fall within the confidence intervals at a given significance level, the model is regarded as adequate and acceptable.
where δ(•) is the Kronecker delta function ( ) ( ) = ( ) − , ( ) = ( + 1) ( + 1), is the mean of and is the normalized cross-correlation given by In practice, if these correlation functions in Equation (9) fall within the confidence intervals at a given significance level, the model is regarded as adequate and acceptable.  Figure 3 presents a detailed flowchart of the proposed system. Suppose we have an initialized ESN architecture and an initial hyperparameter set. After supervised learning with training inputs and outputs, a trained ESN is achieved. Then, its predictions, given the validation set inputs, are compared with true validation outputs using the MSE, named the validation MSE. The BOA, intended to minimize the validation MSE, iteratively adjusts the hyperparameters until an optimal   Figure 3 presents a detailed flowchart of the proposed system. Suppose we have an initialized ESN architecture and an initial hyperparameter set. After supervised learning with training inputs and outputs, a trained ESN is achieved. Then, its predictions, given the validation set inputs, are compared with true validation outputs using the MSE, named the validation MSE. The BOA, intended to minimize the validation MSE, iteratively adjusts the hyperparameters until an optimal set of hyperparameters is found. Next, an ESN is tuned with the optimal hyperparameters and produces predictions, given the test inputs. Those predictions are compared with true test set outputs using the MSE, named the test MSE or MSE values for the test set.

General View of the Proposed Forecasting Approach
Energies 2020, 13, x FOR PEER REVIEW 9 of 19

Results
The optimization is performed for both datasets for 50 runs and the convergence curves of mean objective values and best objective values are analyzed for each type of acquisition function considered in the study. The MSE values of the test sets are compared (prediction versus actual data plots and correlation tests) and the best solutions are presented. The convergence analysis uses the MSE values for the validation set, while statistics descriptions and correlation tests use the MSE values for the test set.
In terms of BOA, the maximum number of iterations had to be arbitrarily set as a stopping criterion. To choose the maximum number of iterations, a single random initialization has been analyzed, increasing the maximum number of iterations in steps of 10 until any improvement had been significantly made, which has been observed after approximately 30 iterations. Due to the possibility of other initialization requirements of more iterations until convergence, this number has been arbitrarily extended to 50 for the BOA in this paper. Since the solution of the problem is unknown, as is the case for most real-world problems, it is not possible to ensure that the probability of converging to local has been eliminated. However, multiple runs of the optimization algorithm from different starting points in the search space are expected to reduce the bias due to starting conditions. On the other side, guessing how many starting points should be enough to eliminate such bias is uncertain, and most of the time is limited by the computational effort required for each optimization run. Fifty runs are often employed in the literature regarding the comparison of algorithms in global optimization tasks.
The main performance index considered is the MSE measure over the test sets, as presented in Tables 3 and 4

Results and Discussion
This section presents the results as well as their discussion.

Results
The optimization is performed for both datasets for 50 runs and the convergence curves of mean objective values and best objective values are analyzed for each type of acquisition function considered in the study. The MSE values of the test sets are compared (prediction versus actual data plots and correlation tests) and the best solutions are presented. The convergence analysis uses the MSE values for the validation set, while statistics descriptions and correlation tests use the MSE values for the test set.
In terms of BOA, the maximum number of iterations had to be arbitrarily set as a stopping criterion. To choose the maximum number of iterations, a single random initialization has been analyzed, increasing the maximum number of iterations in steps of 10 until any improvement had been significantly made, which has been observed after approximately 30 iterations. Due to the possibility of other initialization requirements of more iterations until convergence, this number has been arbitrarily extended to 50 for the BOA in this paper.
Since the solution of the problem is unknown, as is the case for most real-world problems, it is not possible to ensure that the probability of converging to local has been eliminated. However, multiple runs of the optimization algorithm from different starting points in the search space are expected to reduce the bias due to starting conditions. On the other side, guessing how many starting points should be enough to eliminate such bias is uncertain, and most of the time is limited by the computational effort required for each optimization run. Fifty runs are often employed in the literature regarding the comparison of algorithms in global optimization tasks.
The main performance index considered is the MSE measure over the test sets, as presented in Tables 3 and 4 Tables 3 and 4 are graphically illustrated through violin plots in Figure 4a,b. The difference in magnitude from one dataset to another is due to the variable unit, which is in MW for Brazil and GW for Poland.
The results are evaluated in terms of statistical measures, as exhibited in Tables 3 and 4, due to the randomness characteristic of the optimization algorithm. If a single random seed is used, it could bias the results in favor of a specific model. A method to overcome such biasing is to perform several runs of optimization and evaluate the statistical measures of central tendency and dispersion. A model that results in lower mean and median MSE values over 50 runs will probably present a lower MSE value in a new random run of the optimization.
Some features about the error distributions may be noticed from analyzing the violin plots in Figure 4a,b. For both datasets, the LCB acquisition function errors are bi-modal, with the highest mode near the median and the lowest mode near the maximum. The EI acquisition function error distribution is tailored to the minimum for the Polish dataset and symmetric for the Brazilian dataset. Conversely, the PI acquisition function error distribution is symmetric for the Polish dataset and tailored to the minimum for the Brazilian dataset. All errors are more likely to lie near the median value.    The convergence curves for the best model for Polish and Brazilian datasets are presented in Figure 5a,b, respectively (i.e., LCB-BO-ESN for the Polish dataset and EI-BO-ESN for the Brazilian dataset). Each plot contains the log of the MSE for each iteration of the optimization algorithm over the test set-more specifically, the median trend surrounded by boxplots to illustrate divergences among the runs. Figure 6a,b exhibit the execution time behavior along with the iterations for an Intel i7-7500U 3.5 GHz processor and 20 GB RAM (Random Access Memory). The code has been implemented in the Matlab computational environment. The black line represents the median values while the shaded area represents the span between the maximum and minimum execution times of each iteration. The green line represents the cumulative execution time along with the iterations. Histograms of total execution times are shown in Figure 7a,b for the Polish and Brazilian datasets, respectively.
The frequency of values found as best hyperparameters are exhibited in Figures 8 and 9 for the Polish and Brazilian datasets, respectively. Each pair of hyperparameters is compared against each other and color bars indicate the most frequency optimized values. The predictions of the LCB-BO-ESN algorithm over the test set are shown in Figure 10a for the Polish dataset and of the EI-BO-ESN in Figure 10b for the Brazilian dataset, and the correlation tests of the best predictions (i.e., LCB-BO-ESN for Polish and EI-BO-ESN for Brazilian datasets) are presented in Figure 11.

Discussion
From the Polish dataset statistics (Table 3 and Figure 4a) it is noticed that the LCB-BO-ESN achieves the lower mean and median MSE over all the runs and also the minimum MSE and the lowest maximum MSE; however, PI-BO-ESN achieves the lowest standard deviation. All ESNs improved the predictions-that is, they found the lowest minimum values compared to the SVR models.
From the Brazilian dataset statistics (Table 4 and Figure 4b), it can be observed that the EI-BO-ESN achieves the lowest mean standard deviation, minimum and maximum MSE among all the runs, however, LCB-BO-ESN achieves the lowest median MSE value. These results may now serve as a benchmark for future research.
Violin plots (Figure 4) aggregate information about central tendency and dispersion measures with a density distribution curve. The variants EI and PI present a unimodal behavior of MSE values while the LCB variant presents a bi-modality. Such a bi-modality indicates that there is a prominent occurrence of convergence to local minima when LCB variant is employed, despite it present the lowest median for both datasets. However, even when presenting the lowest median for the Brazilian dataset, its higher standard deviation and skew (deviation between the median and mean), together with the inability to find an extreme minimum value, makes EI more advantageous than LCB.
All runs converge for the same approximate value with little deviation after a certain number of iterations, which is 39 for the Polish dataset ( Figure 5a) and 18 for the Brazilian dataset (Figure 5b). After the fifth iteration in the Polish dataset, a value near the minimum had already been achieved in some of the runs as shown in the lower boxplot whiskers, while in the Brazilian dataset this happened from the first iteration. Regarding the execution times, the iteration tends to take approximately a median value two seconds after an initial transient period, presenting a slightly linear increase along with the iterations (Figures 6a, 6b). This linearity in time is perceived at the median cumulative time of the runs, which terminates at 100 seconds for the Polish dataset and 90 seconds for the Brazilian dataset. Most of the runs, for both datasets, take between 80 and 90 seconds of total execution time (Figures 7a, 7b) and all of them are executed in a duration ranging from 80 seconds (minimum) to 130 seconds (maximum).   Regarding the execution times, the iteration tends to take approximately a median value two seconds after an initial transient period, presenting a slightly linear increase along with the iterations (Figure 6a,b). This linearity in time is perceived at the median cumulative time of the runs, which terminates at 100 s for the Polish dataset and 90 s for the Brazilian dataset. Most of the runs, for both datasets, take between 80 and 90 s of total execution time (Figure 7a,b) and all of them are executed in a duration ranging from 80 s (minimum) to 130 s (maximum). Regarding the execution times, the iteration tends to take approximately a median value two seconds after an initial transient period, presenting a slightly linear increase along with the iterations (Figures 6a, 6b). This linearity in time is perceived at the median cumulative time of the runs, which terminates at 100 seconds for the Polish dataset and 90 seconds for the Brazilian dataset. Most of the runs, for both datasets, take between 80 and 90 seconds of total execution time (Figures 7a, 7b) and all of them are executed in a duration ranging from 80 seconds (minimum) to 130 seconds (maximum).    Dealing with optimization on real-word datasets does not bring a known minimum value for comparison, so the best hyperparameters are grouped in pairs to get hints of its best tuning values and relationships. The Polish dataset best hyperparameters (Figure 8) suggests that an SR near one is the best choice considering an ESN with approximately 80 internal units, Tsc near one, Isc near one, Tsh near-zero or −0.5, and Ish near 0.2; however, it would be improved if it reaches around 0.7 with an Fb of 0.6. For example, improved results are also achieved with N = 60 and Isc = 0.4, as well as with Fb = 0.6 and Ish = −0.2.
In terms of the Brazilian dataset, the best hyperparameters (Figure 9), the best choices are lower N with SR = 0.  Figures 10a,10b show that the best predictions follow the nonlinear pattern of the load; in the Polish dataset, it is possible to notice the highest errors between the 60th and 100th hours, where there is a change in the mean value, and for the Brazilian dataset the highest errors are noticed during the first ten hours; however, the predictions follow the actual data. Discrepancies between predicted and actual data are due to the difficult-to-capture non-linear behavior of the time series that have not been taken into account by the model. The prediction model is trained and validated using the previous values of the time series and will generalize over these unknown values with a non-linear relationship to the previous values.     Figure 10a,b show that the best predictions follow the nonlinear pattern of the load; in the Polish dataset, it is possible to notice the highest errors between the 60th and 100th hours, where there is a change in the mean value, and for the Brazilian dataset the highest errors are noticed during the first ten hours; however, the predictions follow the actual data. Discrepancies between predicted and actual data are due to the difficult-to-capture non-linear behavior of the time series that have not been taken into account by the model. The prediction model is trained and validated using the previous values of the time series and will generalize over these unknown values with a non-linear relationship to the previous values.  It is possible to notice that the prediction for the Polish dataset complies with all the tests in Figure 11, while the predictions for the Brazilian dataset do not comply with any. This means there may be inputs (features, load history) that are not properly echoed by the optimized ESN, meaning that perhaps it requires the use of a different activation function or even the addition of extra inputs to the network. It is possible to notice that the prediction for the Polish dataset complies with all the tests in Figure 11, while the predictions for the Brazilian dataset do not comply with any. This means there may be inputs (features, load history) that are not properly echoed by the optimized ESN, meaning that perhaps it requires the use of a different activation function or even the addition of extra inputs to the network.

Conclusions and Future Research Direction
ESNs are an effective alternative to conventional RNNs due to their fast training process and good performance in dynamic system modeling. In this work, it has been proposed the BOA of ESN hyperparameters for load forecasting. Existing methods use a trial and error approach for the hyperparameters tuning or generates a large number of models and then apply a model selection algorithm. Results were compared with other published results and show that the proposed approach achieves similar performance when compared to the best of them; beyond that, a benchmark was proposed for the Brazilian dataset.
In terms of management implications, the adoption of forecasting models to short-term forecasting is useful. Management decisions can be made using reliable information. Hence, decisions taken by electricity suppliers and consumers will be more accurate as the load predictions are also more accurate. Some examples regarding the supplier side are about the allocation of resources such as hydric reservoir level, coal or oil stocks, and operational and maintenance planning. On the consumer side, the management of contracts that offer an incentive for non-peak load hours is more precise and the load distribution may be better with accurate forecasts.
For future research we intend to investigate the influence of other types of activation functions in the ESN, the compromise between the echo state property and the input addition, as well other variants of the BOA. Other optimization algorithms combined with chaotic sequences [67][68][69] could be applied to obtain the best hyperparameters. Besides, LSTM, gated recurrent unit (GRU), least squares-support vector regression (LS-SVR) and other promising machine learning models for time series forecasting must be included in a comparison with the ESN. Future research also intends to include a higher number of datasets with different characteristics, such as lower correlations among features and output, evaluations of the different numbers of iterations and stopping criteria for hyperparameter tuning, as well as the assessment of prediction performance for different prediction intervals.

Conclusions and Future Research Direction
ESNs are an effective alternative to conventional RNNs due to their fast training process and good performance in dynamic system modeling. In this work, it has been proposed the BOA of ESN hyperparameters for load forecasting. Existing methods use a trial and error approach for the hyperparameters tuning or generates a large number of models and then apply a model selection algorithm. Results were compared with other published results and show that the proposed approach achieves similar performance when compared to the best of them; beyond that, a benchmark was proposed for the Brazilian dataset.
In terms of management implications, the adoption of forecasting models to short-term forecasting is useful. Management decisions can be made using reliable information. Hence, decisions taken by electricity suppliers and consumers will be more accurate as the load predictions are also more accurate. Some examples regarding the supplier side are about the allocation of resources such as hydric reservoir level, coal or oil stocks, and operational and maintenance planning. On the consumer side, the management of contracts that offer an incentive for non-peak load hours is more precise and the load distribution may be better with accurate forecasts.
For future research we intend to investigate the influence of other types of activation functions in the ESN, the compromise between the echo state property and the input addition, as well other variants of the BOA. Other optimization algorithms combined with chaotic sequences [67][68][69] could be applied to obtain the best hyperparameters. Besides, LSTM, gated recurrent unit (GRU), least squares-support vector regression (LS-SVR) and other promising machine learning models for time series forecasting must be included in a comparison with the ESN. Future research also intends to include a higher number of datasets with different characteristics, such as lower correlations among features and output, evaluations of the different numbers of iterations and stopping criteria for hyperparameter tuning, as well as the assessment of prediction performance for different prediction intervals.